MIPS SIMD technology and Baikal-T1 processor

Colleagues from Baikal Electronics offered to work with the processor Baikal-T1 [ L1 ] and write about their impressions. For them, this is a way to tell developers about the capabilities and features of their processor. For me, a chance to get a closer look at the system on a modern processor core and in the future to invent fewer “bicycles”, adding, for example, new functionality to the MIPSfpga-plus [ L2 ] project. Well, the usual engineering curiosity, again ...

Today we will discuss the vector expansion of the MIPS SIMD architecture, which is available in the MIPS Warrior P-class P5600 [ L3 ] cores, which means it is also present in the Baikal-T1 processor. The article is aimed at novice developers.

Introduction

Practically always with the development of a device (device / hardware / software and hardware complex, etc.) is associated with the solution of the problem of processing digital and analog signals. Input may include sensor readings, signals from I / O devices, information from a file on a disk, etc. At the output: the image on the monitor, the sound from the speakers, the drive control signals, the indications of the indicators on the dashboard, etc. A "between" input and output - a set of certain mathematical operations.

If we briefly list the ways to implement this "mathematics in hardware", we get the following list of tools available to the developer, which can be applied both together and separately:

implementation in the form of an analog circuit.
software implementation on a microcontroller.
FPGA implementation
software and hardware implementation as a system on a chip
use of digital signal processor
software implementation on a general purpose processor
using a graphics controller

The same list, but in more detail

implementation as an analog circuit
Despite the dominance of digital devices, the world and the human senses still remain analog. Whether we like it or not, but even processing the information “exclusively” in a figure, we still have to set up a filter before entering the ADC. Similarly, integrators, differentiators, adders, etc. can be implemented on analog components. Over the decades of the dominance of analog electronics, engineers have accumulated vast experience in solving various problems and a good developer (even if digital electronics) takes into account this legacy [ D1 ];
software implementation on a microcontroller
The processed signals are few, and the mathematics is not complicated or not demanding of resources? In this case, a relatively inexpensive microcontroller with its ADC, low frequency of operation, energy saving capabilities is quite an option. If necessary, bottlenecks can be written in assembler;
FPGA implementation
If we have high requirements for speed, parallelism, scaling solutions, then we describe mathematics in the form of a module on Verilog or VHDL, choose FPGA, which can work at the frequency necessary for processing. If the solution turns out to be very successful and there will be a sense in its wide replication - welcome to the world of ASIC [ L3 ];
software and hardware implementation as a system on a chip
The system is too complicated to describe it entirely on Verilog, I want to program separate logic in a high-level language, and indeed, to manage everything from Linux? In this case, a solution for us is SoC (System-on-a-Chip, SoC): we take a ready-made processor core (Nios II, MIPSfpga, etc.) and hang it with the necessary peripheral modules, among which will be our special, performing cunning mathematics. Some operations can be made available in the form of processor commands [ L4 ]. And yes, in perspective this can also be implemented in ASIC;
use of a digital signal processor (DSP)
Here we are, in fact, buying a ready-made chip with a processor core, its own set of peripherals and a set of commands specifically targeted at high-speed digital signal processing. Around her, we are building our decision [ L10 , L5 ];
software implementation on a general purpose processor
Each processor manufacturer offers its own architectural solutions to optimize the performance of certain mathematical operations. And the task of the software developer, if necessary, is to use the possibilities offered by the manufacturer to speed up the calculations. This is exactly what will be discussed below for MIPS processors;
using a graphics controller for computing
In order to make the present list more complete, it is impossible not to mention the possibility of bringing the most complex and resource-intensive calculations to the video card [ L6 , L7 ].

There are no perfect tools. The best tool is the one that guarantees the solution of the problem in a reasonable time, for which the project team has the necessary competencies, and which is either available or can be acquired with minimal cost. Budget decisions, customer requirements, and sometimes political reasons are imposed on such decisions.

I hope that this article will be useful as an introduction for those readers who will face the need to optimize the performance of certain calculations on the Baikal-T1 processor or another friend based on the MIPS core (s) where MIPS SIMD technology is available.

Computation resource intensity

Before proceeding further, consider one of the most common problems of digital signal processing (DSP) - filtering. As an example, take a filter with a finite impulse response (FIR, FIR, finite impulse response) [ L8 ]. Without going into the theory of DSP and mathematical calculations, we note the main thing - the equation that describes this type of digital filters:

where x (n) is the input signal, y (n) is the output signal, P is the order of the filter, bi is the filter coefficients. The same equation can be written as follows:

The nature of the input signal x (n) in this case is ignored. Let it be the data obtained from the ADC, but with the same success they can be read from the file. For us in this case it does not matter. Since our current article is “about calculations” and not “about DSP”, then, accordingly, we will not dive into the Magic of filters, but simply use one of the online services to calculate the coefficients [ L9 ]:

Set the desired filtering parameters (for example, one of the predefined options: band stop - notch filter [ L11 ]) and click the Design Filter button:

The result of the calculation is the frequency response of the filter [ L12 ]:

AFC filter

Factor set and source code:

SampleFilter.h

#ifndef SAMPLEFILTER_H_ #define SAMPLEFILTER_H_ /* FIR filter designed with http://t-filter.appspot.com sampling frequency: 2000 Hz * 0 Hz - 200 Hz gain = 1 desired ripple = 5 dB actual ripple = 3.1077303934211127 dB * 300 Hz - 500 Hz gain = 0 desired attenuation = -40 dB actual attenuation = -42.49314043914754 dB * 600 Hz - 1000 Hz gain = 1 desired ripple = 5 dB actual ripple = 3.1077303934211127 dB */ #define SAMPLEFILTER_TAP_NUM 25 typedef struct { double history[SAMPLEFILTER_TAP_NUM]; unsigned int last_index; } SampleFilter; void SampleFilter_init(SampleFilter* f); void SampleFilter_put(SampleFilter* f, double input); double SampleFilter_get(SampleFilter* f); #endif

SampleFilter.s

 #include "SampleFilter.h" static double filter_taps[SAMPLEFILTER_TAP_NUM] = { 0.037391727827352596, -0.03299884552335979, 0.044230583967321345, 0.0023050970833628304, -0.06768087195950104, -0.046347105409124706, -0.011717387509232432, -0.0707342284185183, -0.049766517282999544, 0.16086413543836361, 0.21561058688743148, -0.10159456907827959, 0.6638637561392535, -0.10159456907827959, 0.21561058688743148, 0.16086413543836361, -0.049766517282999544, -0.0707342284185183, -0.011717387509232432, -0.046347105409124706, -0.06768087195950104, 0.0023050970833628304, 0.044230583967321345, -0.03299884552335979, 0.037391727827352596 }; void SampleFilter_init(SampleFilter* f) { int i; for(i = 0; i < SAMPLEFILTER_TAP_NUM; ++i) f->history[i] = 0; f->last_index = 0; } void SampleFilter_put(SampleFilter* f, double input) { f->history[f->last_index++] = input; if(f->last_index == SAMPLEFILTER_TAP_NUM) f->last_index = 0; } double SampleFilter_get(SampleFilter* f) { double acc = 0; int index = f->last_index, i; for(i = 0; i < SAMPLEFILTER_TAP_NUM; ++i) { index = index != 0 ? index-1 : SAMPLEFILTER_TAP_NUM-1; acc += f->history[index] * filter_taps[i]; }; return acc; }

Let's look at the filtering parameters, the SampleFilter_get function, recall the FIR filter equation above and note the most important points for us:

in order to get one sample of y (n) , which in our case is equivalent to one call of the SampleFilter_get function , we need to cycle 25 input samples of x (n) in a loop (filter order, see macro SAMPLEFILTER_TAP_NUM and "taps" in the screenshot of the interface);
if we want this processing to be done on the fly (for example, as x (n) arrives from the ADC), then we must have time to perform it taking into account the sampling frequency, which in our case is 2000Hz;
for each of these 25 samples x (n), we perform a series of mathematical operations: multiplication by the corresponding coefficient, addition to the register-accumulator. In addition, do not forget about the related operations to read / write from / to memory, which also takes time.

Now suppose that the conditions of the task were clarified due to some objective reasons:

As a result, we get the following frequency response:

AFC filter

What is important for us:

due to an increase in the sampling frequency to 4000Hz, the time we can spend on computation has decreased by 2 times if we want to get results on the fly;
the number of iterations (taps, filter order) increased almost 3 times (from 25 to 67). If you do not make changes to the non-uniformity of the frequency response (leave ripple / att. = 5 dB), then the number of iterations will still increase by 2 times;
Thus, a slight change in the filtering parameters led to an increase in the computation-intensive resources by 6 times (or 4).

I would not like the reader to pay attention to the absolute values of the filter parameters given in the example. The main idea that I would like to convey is that at some point an increase in computation-intensive resources may go beyond the previously predicted limits. And after a small change in the algorithm, its parameters, or input data, all of a sudden, it may turn out that your processor core only deals with what it "shoves" the data, and even then it does not have time to do it, not to mention the performance of the remaining tasks. Or it successfully copes, but the load created at the same time or the duration of the calculations no longer meet the requirements for the system. And at this moment you are faced with more than ever the task of optimization.

Computation speed

After we have seen a problem with a simple example, we will look at its solution. Processor manufacturers go to a number of tricks in order to increase the speed of calculations: increase the frequency, increase the number of processor cores, add new commands, experiment with the configuration of the conveyor, cache size, switch to using more and more high-speed buses and interfaces.

Also, no one takes away from the developer the right to realize the most bottlenecks on the assembler, experiment with the compiler options, change the algorithm to a less resource-intensive one or behave more predictably on the same range of parameters.

We will concentrate on two ways to increase processing speed, which cannot be used without support from the processor architecture:

combining several arithmetic operations in one team;
implementation of the SIMD approach.

Combining arithmetic operations

Let's return to our filter and carefully look at the code for the SampleFilter_get function:

 double SampleFilter_get(SampleFilter* f) { double acc = 0; int index = f->last_index, i; for(i = 0; i < SAMPLEFILTER_TAP_NUM; ++i) { index = index != 0 ? index-1 : SAMPLEFILTER_TAP_NUM-1; acc += f->history[index] * filter_taps[i]; }; return acc; }

And especially on the line:

 acc += f->history[index] * filter_taps[i];

Here we see 2 consecutively performed operations: multiplication by a coefficient and accumulation of the results of this command in a battery variable. This combination of multiplication and addition is very common in DSP algorithms. But if these two operations are so often side by side with each other, then why not combine them into a single command that is executed during 1 clock cycle? The idea of such a union came to the heads of engineers for a long time. Thus, the command “multiply with accumulation” (combined multiplication-addition, multiply – accumulate operation, MAC) appeared, which is now present in all digital signal processors:

SIMD approach

Let's approach the solution from the other side. And why would we, instead of working with each x (n) , separately, not combine several samples into one vector (array) and apply the command to the whole vector at once, or more precisely, to each element of the vector at the same time? In this case, in one cycle (apart from working with memory) several samples can be processed at once:

And the larger the maximum size of the vector, the higher the processing speed will be. This principle of organization of calculations is called SIMD (single instruction, multiple data; single stream of commands, multiple data stream) [ L13 ]. This approach builds the work of vector processors [ L14 ] and vector extensions to scalar processors: SSE and AVX for x86 architecture, MIPS SIMD [ L15 ] for MIPS architecture.

MIPS SIMD

Now that we understand the basic principles on which the vector extensions of the architecture are built, we can go directly to MIPS SIMD. An exhaustive description of this extension is given in the documentation [ D2 ], we note the main points:

Available in two versions: MIPS32 SIMD and MIPS64 SIMD. Since Baikal-T1 is based on the cores of the MIPS 32 r5 architecture, in the following, when mentioning MIPS SIMD, we will mean MIPS32 SIMD. On the website and in the documentation of Imagination Technologies, it is also referred to as MSA (MIPS SIMD Architecture);
Work is done using 32x 128-bit registers for processing vector data. Each of which can be represented as: 16x 8-bit vectors, 8x 16-bit, 4x 32-bit, or 2x 64-bit vectors:
involves more than 150 instructions for processing data: integer, floating-point and fixed-point, including bit operations, comparisons, conversions:

Integer operations

Mnemonic	Instruction Description
ADDV, ADDVI	Add
ADD_A, ADDS_A	Add and Saturated Add Absolute Values
ADDS_S, ADDS_U	Signed and Unsigned Saturated Add
HADD_S, HADD_U	Signed and Unsigned Horizontal Add
ASUB_S, ASUB_U	Absolute Value of Signed and Unsigned Subtract
AVE_S, AVE_U	Signed and Unsigned Average
AVER_S, AVER_U	Signed and Unsigned Average with Rounding
DOTP_S, DOTP_U	Signed and Unsigned Dot Product
DPADD_S, DPADD_U	Signed and Unsigned Dot Product Add
DPSUB_S, DPSUB_U	Signed and Unsigned Dot Product Subtract
Div_s div_u	Divide
MADDV	Multiply-add
MAX_A, MIN_A	Maximum and Minimum of Absolute Values
MAX_S, MAXI_S, MAX_U, MAXI_U	Signed and Unsigned Maximum
MIN_S, MINI_S, MIN_U, MINI_U	Signed and Unsigned Maximum
MSUBV	Multiply-subtract
MULV	Multiply
MOD_S, MOD_U	Signed and Unsigned Remainder (Modulo)
SAT_S, SAT_U	Signed and Unsigned Saturate
SUBS_S, SUBS_U	Signed and Unsigned Saturated Subtract
HSUB_S, HSUB_U	Signed and Unsigned Horizontal Subtract
SUBSUU_S	Signed Saturated Unsigned Subtract
SUBSUS_U	Unsigned Saturated Signed Subtract from Unsigned
SUBV, SUBVI	Subtract

Bit operations

Mnemonic	Instruction Description
AND, ANDI	Logical and
BCLR, BCLRI	Bit clear
BINSL, BINSLI, BINSR, BINSRI	Bit Insert Left and Right
BMNZ, BMNZI	Bit Move If Not Zero
BMZ, BMZI	Bit Move If Zero
BNEG, BNEGI	Bit negate
BSEL, BSELI	Bit select
BSET, BSETI	Bit set
NLOC Leading	One beat count
NLZC Leading	Zero Bits Count
NOR, NORI	Logical negative or
PCNT	Population (Bits Set to 1) Count
OR, ORI	Logical or
SLL, SLLI	Shift left
SRA, SRAI	Shift Right Arithmetic
SRAR, SRARI	Rounding Shift Right Arithmetic
SRL, SRLI	Shift right logical
SRLR, SRLRI	Rounding Shift Right Logical
XOR, XORI	Logical Exclusive Or

Floating point arithmetic

Mnemonic	Instruction Description
FADD	Floating-point addition
Fdiv	Floating-point division
FEXP2	Floating point base 2 exponentiation
FLOG2	Floating point base 2 logarithm
FMADD, FMSUB	Floating-Point Fused Multiply-Add and Multiply-Subtract
FMAX, FMIN	Floating-Point Maximum and Minimum
FMAX_A, FMIN_A	Floating-Point Maximum and Minimum of Absolute Values
FMUL	Floating-point multiplication
FRCP	Approximate Floating-Point Reciprocal
FRINT	Floating point round to integer
FRSQRT	Approximate Floating-Point Reciprocal of Square Root
FSQRT	Floating point square root
FSUB	Floating point subtraction

Non-floating point arithmetic

Mnemonic	Instruction Description
FCLASS	Floating point class mask

Floating point comparisons

Mnemonic	Instruction Description
FCAF	Floating-Point Quiet Compare Always False
FCUN	Floating-Point Quiet Compare Unordered
FCOR	Floating-Point Quiet Compare Ordered
FCEQ	Floating-Point Quiet Compare Equal
FCUNE	Floating-Point Quiet Compare Unordered or Not Equal
FCUEQ	Floating-Point Quiet Compare Unordered or Equal
FCNE	Floating-Point Quiet Compare Not Equal
FCLT	Floating Point Quiet Compare Less Than
FCULT	Floating-Point Quiet Compare Unordered or Less Than
FCLE	Floating-Point Quiet
FCULE	Floating-Point Quiet Compare Unordered or Less Than or Equal
FSAF	Floating-Point Signaling Compare Always False
FSUN	Floating-Point Signaling Compare Unordered
FSOR	Floating-Point Signaling Compare Ordered
FSEQ	Floating-Point Signaling Compare Equal
FSUNE	Floating-Point Signaling Compare Unordered or Not Equal
FSUEQ	Floating-Point Signaling Compare Unordered or Equal
FSNE	Floating-Point Signaling Compare Not Equal
Fslt	Floating-Point Signaling Compare Less Than
FSULT	Floating-Point Signaling Compare Unordered or Less Than
FSLE	Floating-Point Signaling Compare Less Than or Equal
FSULE	Floating-Point Signaling Compare Unordered or Less Than or Equal

Float operations

Mnemonic	Instruction Description
FEXDO	Floating Point Down-Convert Interchange Format
FEXUPL, FEXUPR	Left-Half and Right-Half Floating-Point Up-Convert Interchange Format
FFINT_S, FFINT_U	Floating-Point Convert from Signed and Unsigned Integer
FFQL, FFQR	Left-Half and Right-Half Floating-Point Convert from Fixed-Point
FTINT_S, FTINT_U	Floating-Point Round and Convert to Signed and Unsigned Integer
FTRUNC_S, FTRUNC_U	Truncate and Convert to Signed and Unsigned Integer
FTQ	Floating Point Round and Convert to Fixed Point

Fixed point operations

Mnemonic	Instruction Description
MADD_Q, MADDR_Q	Fixed-Point Multiply and Add-on and Rounding
MSUB_Q, MSUBR_Q	Fixed-Point Multiply and Subtract without and with Rounding
MUL_Q, MULR_Q	Fixed-Point Multiply without and Rounding

Branch operations

Mnemonic	Instruction Description
Bnz	Branch If Not Zero
Bz	Branch if zero
CEQ, CEQI	Compare Equal
CLE_S, CLEI_S, CLE_U, CLEI_U	Compare Less-Than-or-Equal Signed and Unsigned
CLT_S, CLTI_S, CLT_U, CLTI_U	Compare Less-Than Signed and Unsigned

Vector Loading and Unloading Operations

Mnemonic	Instruction Description
CFCMSA, CTCMSA	Copy Register and MSA Control Register
LD	Load vector
Ldi	Load Immediate
MOVE	Vector to Vector Move
SPLAT, SPLATI	Replicate Vector Element
FILL FILL	Vector from GPR
INSERT, INSVE	Insert GPR and Vector element 0 to Vector Element
COPY_S, COPY_U	Copy element to GPR Signed and Unsigned
ST	Store Vector

Vector element permutation operations

Mnemonic	Instruction Description
ILVEV, ILVOD	Interleave Even, odd
ILVL, ILVR	Interleave the Left, Right
PCKEV, PCKOD	Pack Even and Odd Elements
SHF	Set shuffle
SLD, SLDI	Element slide
VSHF	Vector shuffle

Other operations

Mnemonic	Instruction Description
LSA	Left-shift add or load / store address calculation

Most commands are available in several versions, differing from each other in the size of the vector element and the type of data being processed. So the multiplication with accumulation, already known to us, is implemented as: FMADD - for floating-point numbers, MADD_Q - for fixed-point numbers (with saturation), MADDR_Q - for fixed-point numbers (with saturation and rounding), MADDV - for integers . At the same time, for example, MADDV is available in 4 versions: MADDV.B - when the elements for byte processing (x8), MADDV.H - when each element of the vector is a half-word (x16), MADDV.W - a word (x32) and MADDV. D - double word (x64)

Formal command description

Compiler support

MIPS SIMD is supported by the gcc compiler, however, this support has its own features:

for vector elements of various dimensions, separate data types have been introduced [ L16 ];
we are spared from the need to work directly with registers (work is carried out at the level of variables of the appropriate type)
for all of the above operations, high-level pseudonyms have been introduced [ L17 ];
according to the gcc documentation, the compiler does not make independent decisions about the use of MIPS SIMD, the same is confirmed by a well-disassembled example of the use of MIPS SIMD, published on the website Imagination Technologies [ D3 ]. Thus, if we draw a rough analogy, then working with MIPS SIMD is closer to writing assembly inserts, rather than to high-level programming. What should be considered when making decisions about optimization.

Code before optimization

 #define ROUND_POWER_OF_TWO(value, n) (((value) + (1 << ((n) - 1))) >> (n)) static inline unsigned char clip_pixel(int i32Val) { return ((i32Val) > 255) ? 255u : ((i32Val) < 0) ? 0u : (i32Val); } void vert_filter_8taps_16width_c(unsigned char *pSrc, // SOURCE POINTER int SrcStride, // SOURCE BUFFER PITCH unsigned char *pDst, // DEST POINTER int DstStride, // DEST BUFFER PITCH char *pFilter, // POINTER TO FILTER BANK int Height) // HEIGHT OF THE BLOCK { unsigned int Row, Col; int FiltSum; short Src0, Src1, Src2, Src3, Src4, Src5, Src6, Src7; pSrc -= (8 / 2 - 1) * SrcStride; // MOVE INPUT SRC POINTER TO APPROPRIATE POSITION // LOOP FOR NUMBER OF COLUMNS-16 for (Col = 0; Col < 16; ++Col) { Src0 = pSrc[0 * SrcStride]; Src1 = pSrc[1 * SrcStride]; Src2 = pSrc[2 * SrcStride]; Src3 = pSrc[3 * SrcStride]; Src4 = pSrc[4 * SrcStride]; Src5 = pSrc[5 * SrcStride]; Src6 = pSrc[6 * SrcStride]; // LOOP FOR NUMBER OF ROWS for (Row = 0; Row < Height; Row++) { Src7 = pSrc[(7 + Row) * SrcStride]; FiltSum = 0; // ACCUMULATED FILTER SUM += PIXEL * FILTER COEFF FiltSum += (Src0 * pi8Filter[0]); FiltSum += (Src1 * pi8Filter[1]); FiltSum += (Src2 * pi8Filter[2]); FiltSum += (Src3 * pi8Filter[3]); FiltSum += (Src4 * pi8Filter[4]); FiltSum += (Src5 * pi8Filter[5]); FiltSum += (Src6 * pi8Filter[6]); FiltSum += (Src7 * pi8Filter[7]); FiltSum = ROUND_POWER_OF_TWO(FiltSum, 7); // ROUNDING pDst[Row * DstStride] = clip_pixel(FiltSum);// CLIP RESULT IN 0-255(UNSIGNED CHAR) // PREPARING FOR NEXT CONVOLUTION- SLIDING WINDOW Src0 = Src1; Src1 = Src2; Src2 = Src3; Src3 = Src4; Src4 = Src5; Src5 = Src6; Src6 = Src7; } pSrc += 1; pDst += 1; } }

Code after optimization using MIPS SIMD

 /* MSA VECTOR TYPES */ #define WRLEN 128 // VECTOR REGISTER LENGTH 128-BIT #define NUMWRELEM (WRLEN >> 3) typedef signed char IMG_VINT8 __attribute__ ((vector_size(NUMWRELEM))); //VEC SIGNED BYTES typedef unsigned char IMG_VUINT8 __attribute__ ((vector_size(NUMWRELEM))); //VEC UNSIGNED BYTES typedef short IMG_VINT16 __attribute__ ((vector_size(NUMWRELEM))); //VEC SIGNED HALF-WORD #define LOAD_UNPACK_VEC(pSrc, SrcStride, vi16VecRight, vi16VecLeft) \ { \ IMG_VUINT8 vu8Src; \ IMG_VINT16 vi16Vec0; \ IMG_VINT8 vi8Tmp0; \ /* LOAD INPUT VECTOR */ \ vu8Src = *((IMG_VINT8 *)(pSrc)); \ /* RANGE WARPING TO MAINTAIN 16 BIT PRECISION */ \ vi16Vec0 = __builtin_msa_xori_b(vu8Src, 128); \ /* CALCULATE SIGN EXTENSION */ \ vi8Tmp0 = __builtin_msa_clti_s_b(vi16Vec0, 0); \ /* INTERLEAVE RIGHT TO 16 BIT VEC */ \ vi16VecRight = __builtin_msa_ilvr_b(vi8Tmp0, vi16Vec0); \ /* INTERLEAVE LEFT TO 16 BIT VEC */ \ vi16VecLeft = __builtin_msa_ilvl_b(vi8Tmp0, vi16Vec0); \ pSrc += SrcStride; \ } void vert_filter_8taps_16width_msa(unsigned char *pSrc, // SOURCE POINTER int SrcStride, // SOURCE BUFFER PITCH unsigned char *pDst, // DEST POINTER int DstStride, // DEST BUFFER PITH char *pFilter, // POINTER TO FILTER BANK int Height) // HEIGHT OF THE BLOCK { int u32LoopCnt; VINT16 vi16Vec0Right, vi16Vec1Right, vi16Vec2Right, vi16Vec3Right; VINT16 vi16Vec4Right, vi16Vec5Right, vi16Vec6Right, vi16Vec7Right; VINT16 vi16Vec0Left, vi16Vec1Left, vi16Vec2Left, vi16Vec3Left; VINT16 vi16Vec4Left, vi16Vec5Left, vi16Vec6Left, vi16Vec7Left; VINT16 vi16Temp1Right, vi16Temp1Left; VINT16 vi16Filt0, vi16Filt1, vi16Filt2, vi16Filt3; VINT16 vi16Filt4, vi16Filt5, vi16Filt6, vi16Filt7; pSrc -= (3 * SrcStride); // PREPARE FILTER COEFF IN VEC REGISTERS vi16Filt0 = __builtin_msa_fill_h(*(pFilter)); vi16Filt1 = __builtin_msa_fill_h(*(pFilter + 1)); vi16Filt2 = __builtin_msa_fill_h(*(pFilter + 2)); vi16Filt3 = __builtin_msa_fill_h(*(pFilter + 3)); vi16Filt4 = __builtin_msa_fill_h(*(pFilter + 4)); vi16Filt5 = __builtin_msa_fill_h(*(pFilter + 5)); vi16Filt6 = __builtin_msa_fill_h(*(pFilter + 6)); vi16Filt7 = __builtin_msa_fill_h(*(pFilter + 7)); //LOAD 7 INPUT VECTORS LOAD_UNPACK_VEC(pSrc, SrcStride, vi16Vec0Right, vi16Vec0Left) LOAD_UNPACK_VEC(pSrc, SrcStride, vi16Vec1Right, vi16Vec1Left) LOAD_UNPACK_VEC(pSrc, SrcStride, vi16Vec2Right, vi16Vec2Left) LOAD_UNPACK_VEC(pSrc, SrcStride, vi16Vec3Right, vi16Vec3Left) LOAD_UNPACK_VEC(pSrc, SrcStride, vi16Vec4Right, vi16Vec4Left) LOAD_UNPACK_VEC(pSrc, SrcStride, vi16Vec5Right, vi16Vec5Left) LOAD_UNPACK_VEC(pSrc, SrcStride, vi16Vec6Right, vi16Vec6Left) // START CONVOLUTION VERTICALLY for (u32LoopCnt = Height; u32LoopCnt--; ) { //LOAD 8TH INPUT VECTOR LOAD_UNPACK_VEC(pSrc, SrcStride, vi16Vec7Right, vi16Vec7Left) /* FILTER CALC */ IMG_VINT16 vi16Tmp1, vi16Tmp2; IMG_VINT8 vi8Tmp3; // 8 TAP VECTORIZED CONVOLUTION FOR RIGHT HALF vi16Tmp1 = (vi16Vec0Right * vi16Filt0); vi16Tmp1 += (vi16Vec1Right * vi16Filt1); vi16Tmp1 += (vi16Vec2Right * vi16Filt2); vi16Tmp1 += (vi16Vec3Right * vi16Filt3); vi16Tmp2 = (vi16Vec4Right * vi16Filt4); vi16Tmp2 += (vi16Vec5Right * vi16Filt5); vi16Tmp2 += (vi16Vec6Right * vi16Filt6); vi16Tmp2 += (vi16Vec7Right * vi16Filt7); vi16Temp1Right = __builtin_msa_adds_s_h(vi16Tmp1, vi16Tmp2); // 8 TAP VECTORIZED CONVOLUTION FOR LEFT HALF vi16Tmp1 = (vi16Vec0Left * vi16Filt0); vi16Tmp1 += (vi16Vec1Left * vi16Filt1); vi16Tmp1 += (vi16Vec2Left * vi16Filt2); vi16Tmp1 += (vi16Vec3Left * vi16Filt3); vi16Tmp2 = (vi16Vec4Left * vi16Filt4); vi16Tmp2 += (vi16Vec5Left * vi16Filt5); vi16Tmp2 += (vi16Vec6Left * vi16Filt6); vi16Tmp2 += (vi16Vec7Left * vi16Filt7); vi16Temp1Left = __builtin_msa_adds_s_h(vi16Tmp1, vi16Tmp2); // ROUNDING RIGHT SHIFT RANGE CLIPPING AND NARROWING vi16Temp1Right = __builtin_msa_srari_h(vi16Temp1Right, 7); vi16Temp1Right = __builtin_msa_sat_s_h(vi16Temp1Right, 7); vi16Temp1Left = __builtin_msa_srari_h(vi16Temp1Left, 7); vi16Temp1Left = __builtin_msa_sat_s_h(vi16Temp1Left, 7); vi8Tmp3 = __builtin_msa_pckev_b(vi16Temp1Left, vi16Temp1Right); vi8Tmp3 = __builtin_msa_xori_b(vi8Tmp3, 128); // STORE OUTPUT VEC *((IMG_VINT8 *)(pDst)) = (vi8Tmp3); pDst += DstStride; // PREPARING FOR NEXT CONVOLUTION- SLIDING WINDOW vi16Vec0Right = vi16Vec1Right; vi16Vec1Right = vi16Vec2Right; vi16Vec2Right = vi16Vec3Right; vi16Vec3Right = vi16Vec4Right; vi16Vec4Right = vi16Vec5Right; vi16Vec5Right = vi16Vec6Right; vi16Vec6Right = vi16Vec7Right; vi16Vec0Left = vi16Vec1Left; vi16Vec1Left = vi16Vec2Left; vi16Vec2Left = vi16Vec3Left; vi16Vec3Left = vi16Vec4Left; vi16Vec4Left = vi16Vec5Left; vi16Vec5Left = vi16Vec6Left; vi16Vec6Left = vi16Vec7Left; } }

Performance evaluation

Initially, I had thoughts to write the simplest application (synthetic test) to assess the performance gain when using MIPS SIMD. But despite its attractiveness, this option is not indicative, due to its isolation from the user's real tasks. Fortunately, the employees of Imagination Technologies and MIPS have made a significant contribution to ffmpeg [ L18 ] - a widely used open source application designed to convert audio and video [ L19 ]. I believe that they, like no one else, know how to properly use the technology in question, which means this code should be as efficient as possible.

Thus, if we compile ffmpeg in two versions: with and without MIPS SIMD support, we can compare the speed of work on the same input data and draw some conclusion about the effectiveness of vector calculations based on the results.

Build ffmpeg

Runs on an x86 machine running Linux. Development tools from Imagination Technologies [ L20 ] are used in cross-compilation mode. Tests are performed on the latest stable release at the time of writing this article - ffmpeg 3.3 [ L19 ].

Configuration ffmpeg for the version with support for MIPS SIMD:

 ./configure --enable-cross-compile --prefix=../ffmpeg-msa --cross-prefix=/home/stas/mipsfpga/toolchain/mips-mti-linux-gnu/2016.05-03/bin/mips-mti-linux-gnu- --arch=mips --cpu=p5600 --target-os=linux --extra-cflags="-EL -static" --extra-ldflags="-EL -static" --disable-iconv

And with disabled MIPS SIMD support:

 ./configure --enable-cross-compile --prefix=../ffmpeg-soft --cross-prefix=/home/stas/mipsfpga/toolchain/mips-mti-linux-gnu/2016.05-03/bin/mips-mti-linux-gnu- --arch=mips --cpu=p5600 --target-os=linux --extra-cflags="-EL -static" --extra-ldflags="-EL -static" --disable-iconv --disable-msa

Description of configuration settings

Parameter	Description
--enable-cross-compile	assembly will be carried out on a machine with an excellent architecture
--prefix = .. / ffmpeg-msa	the directory in which files will be placed after the make install command
--cross-prefix = .. / mips-mti-linux-gnu-	toolchain path
--arch = mips	target architecture - MIPS
--cpu = p5600	target processor core - p5600
--target-os = linux	target OS - Linux
- extra-cflags = "- EL -static"	target system - little endian, use static binding
- extra-ldflags = "- EL -static"	similarly
--disable-iconv	disable text encoding functionality
--disable-msa	do not use MIPS SIMD

If you plan to repeat these steps, then note that the ffmpeg 3.3 build with MIPS SIMD support falls out with a minor error, which you need to add to the libavcodec \ mips \ hevcpred_msa.c file to add:

 #include "libavcodec/hevcdec.h"

Testing

It is executed on the Baikal-T1 processor:

 # uname -a Linux baikal-BFK-18446744073709551615 4.4.41-bfk #0 SMP Tue Apr 25 15:54:24 MSK 2017 mips GNU/Linux

The input data are two videos encoded using x264 [ L21 ] and x265 [ L22 ]. The test task is to decode the video with getting screenshots at regular intervals:

 ./ffmpeg-mips/ffmpeg-msa/bin/ffmpeg -i ./The\ Simpsons\ Movie\ -\ Trailer_x264.mp4 -vf fps=1/10 ./out_img/ffmpeg-msa_x264_%d.jpg -report -benchmark ./ffmpeg-mips/ffmpeg-msa/bin/ffmpeg -i ./Tears_400_x265.mp4 -vf fps=1 ./out_img/ffmpeg-msa_x265_%d.jpg -report -benchmark ./ffmpeg-mips/ffmpeg-soft/bin/ffmpeg -i ./The\ Simpsons\ Movie\ -\ Trailer_x264.mp4 -vf fps=1/10 ./out_img/ffmpeg-soft_x264_%d.jpg -report -benchmark ./ffmpeg-mips/ffmpeg-soft/bin/ffmpeg -i ./Tears_400_x265.mp4 -vf fps=1 ./out_img/ffmpeg-soft_x265_%d.jpg -report -benchmark

Description of launch options

Parameter	Description
-i ./Tears_400_x265.mp4	file to be processed
-vf fps = 1	period (frequency) of taking screenshots (1 sec - for a short video, 10 sec - for a long one)
./out_img/ffmpeg-soft x264 % d.jpg	output file name pattern
-report	generate a report on the results of work
-benchmark	include performance data

Ffmpeg results

Scenario	Duration (sec)
x264 decoding with MIPS SIMD support	113
x265 MIPS SIMD	22
x264 MIPS SIMD	164
x265 MIPS SIMD	52

, MIPS SIMD 1.5 — 2.4 .

ffmpeg github [ L23 ].

x264 MIPS SIMD (, )

 ffmpeg started on 2010-10-18 at 00:30:26 Report written to "ffmpeg-20101018-003026.log" Command line: ./ffmpeg-mips/ffmpeg-msa/bin/ffmpeg -i "./The Simpsons Movie - Trailer_x264.mp4" -vf "fps=1/10" "./out_img/ffmpeg-msa_x264_%d.jpg" -report -benchmark ffmpeg version 3.3 Copyright (c) 2000-2017 the FFmpeg developers built with gcc 4.9.2 (Codescape GNU Tools 2016.05-03 for MIPS MTI Linux) configuration: --enable-cross-compile --prefix=../ffmpeg-msa --cross-prefix=/home/stas/mipsfpga/toolchain/mips-mti-linux-gnu/2016.05-03/bin/mips-mti-linux-gnu- --arch=mips --cpu=p5600 --target-os=linux --extra-cflags='-EL -static' --extra-ldflags='-EL -static' --disable-iconv libavutil 55. 58.100 / 55. 58.100 libavcodec 57. 89.100 / 57. 89.100 libavformat 57. 71.100 / 57. 71.100 libavdevice 57. 6.100 / 57. 6.100 libavfilter 6. 82.100 / 6. 82.100 libswscale 4. 6.100 / 4. 6.100 libswresample 2. 7.100 / 2. 7.100 Splitting the commandline. Reading option '-i' ... matched as input url with argument './The Simpsons Movie - Trailer_x264.mp4'. Reading option '-vf' ... matched as option 'vf' (set video filters) with argument 'fps=1/10'. Reading option './out_img/ffmpeg-msa_x264_%d.jpg' ... matched as output url. Reading option '-report' ... matched as option 'report' (generate a report) with argument '1'. Reading option '-benchmark' ... matched as option 'benchmark' (add timings for benchmarking) with argument '1'. Finished splitting the commandline. Parsing a group of options: global . Applying option report (generate a report) with argument 1. Applying option benchmark (add timings for benchmarking) with argument 1. Successfully parsed a group of options. Parsing a group of options: input url ./The Simpsons Movie - Trailer_x264.mp4. Successfully parsed a group of options. Opening an input file: ./The Simpsons Movie - Trailer_x264.mp4. [file @ 0x1fce0e0] Setting default whitelist 'file,crypto' [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1fcd980] Format mov,mp4,m4a,3gp,3g2,mj2 probed with size=2048 and score=100 [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1fcd980] ISO: File Type Major Brand: isom [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1fcd980] Unknown dref type 0x206c7275 size 12 [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1fcd980] Unknown dref type 0x206c7275 size 12 [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1fcd980] Before avformat_find_stream_info() pos: 73516232 bytes read:65587 seeks:1 nb_streams:2 [h264 @ 0x1fcecb0] nal_unit_type: 7, nal_ref_idc: 3 [h264 @ 0x1fcecb0] nal_unit_type: 8, nal_ref_idc: 3 [h264 @ 0x1fcecb0] nal_unit_type: 6, nal_ref_idc: 0 [h264 @ 0x1fcecb0] nal_unit_type: 5, nal_ref_idc: 3 [h264 @ 0x1fcecb0] user data:"x264 - core 54 svn-620M - H.264/MPEG-4 AVC codec - Copyleft 2005 - http://www.videolan.org/x264.html - options: cabac=1 ref=5 deblock=1:0:0 analyse=0x1:0x131 me=umh subme=6 brdo=1 mixed_ref=0 me_range=16 chroma_me=1 trellis=1 8x8dct=0 cqm=0 deadzone=21,11 chroma_qp_offset=0 threads=1 nr=0 decimate=1 mbaff=0 bframes=1 b_pyramid=0 b_adapt=1 b_bias=0 direct=3 wpredb=0 bime=0 keyint=250 keyint_min=25 scenecut=40 rc=2pass bitrate=4214 ratetol=1.0 rceq='blurCplx^(1-qComp)' qcomp=0.60 qpmin=10 qpmax=51 qpstep=4 cplxblur=20.0 qblur=0.5 ip_ratio=1.40 pb_ratio=1.30" [h264 @ 0x1fcecb0] Reinit context to 1280x544, pix_fmt: yuv420p [h264 @ 0x1fcecb0] no picture [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1fcd980] All info found [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1fcd980] After avformat_find_stream_info() pos: 94845 bytes read:141348 seeks:2 frames:13 Input #0, mov,mp4,m4a,3gp,3g2,mj2, from './The Simpsons Movie - Trailer_x264.mp4': Metadata: major_brand : isom minor_version : 1 compatible_brands: isomavc1 creation_time : 2007-02-19T05:03:04.000000Z Duration: 00:02:17.30, start: 0.000000, bitrate: 4283 kb/s Stream #0:0(und), 12, 1/24000: Video: h264 (Main) (avc1 / 0x31637661), yuv420p, 1280x544, 4221 kb/s, 23.98 fps, 23.98 tbr, 24k tbn, 47.95 tbc (default) Metadata: creation_time : 2007-02-19T05:03:04.000000Z handler_name : GPAC ISO Video Handler Stream #0:1(und), 1, 1/48000: Audio: aac (HE-AAC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 64 kb/s (default) Metadata: creation_time : 2007-02-19T05:03:08.000000Z handler_name : GPAC ISO Audio Handler Successfully opened the file. Parsing a group of options: output url ./out_img/ffmpeg-msa_x264_%d.jpg. Applying option vf (set video filters) with argument fps=1/10. Successfully parsed a group of options. Opening an output file: ./out_img/ffmpeg-msa_x264_%d.jpg. Successfully opened the file. detected 2 logical cores [h264 @ 0x20191e0] nal_unit_type: 7, nal_ref_idc: 3 [h264 @ 0x20191e0] nal_unit_type: 8, nal_ref_idc: 3 Stream mapping: Stream #0:0 -> #0:0 (h264 (native) -> mjpeg (native)) Press [q] to stop, [?] for help cur_dts is invalid (this is harmless if it occurs once at the start per stream) cur_dts is invalid (this is harmless if it occurs once at the start per stream) [h264 @ 0x20191e0] nal_unit_type: 6, nal_ref_idc: 0 [h264 @ 0x20191e0] nal_unit_type: 5, nal_ref_idc: 3 [h264 @ 0x20191e0] user data:"x264 - core 54 svn-620M - H.264/MPEG-4 AVC codec - Copyleft 2005 - http://www.videolan.org/x264.html - options: cabac=1 ref=5 deblock=1:0:0 analyse=0x1:0x131 me=umh subme=6 brdo=1 mixed_ref=0 me_range=16 chroma_me=1 trellis=1 8x8dct=0 cqm=0 deadzone=21,11 chroma_qp_offset=0 threads=1 nr=0 decimate=1 mbaff=0 bframes=1 b_pyramid=0 b_adapt=1 b_bias=0 direct=3 wpredb=0 bime=0 keyint=250 keyint_min=25 scenecut=40 rc=2pass bitrate=4214 ratetol=1.0 rceq='blurCplx^(1-qComp)' qcomp=0.60 qpmin=10 qpmax=51 qpstep=4 cplxblur=20.0 qblur=0.5 ip_ratio=1.40 pb_ratio=1.30" [h264 @ 0x20191e0] Reinit context to 1280x544, pix_fmt: yuv420p [h264 @ 0x20191e0] no picture cur_dts is invalid (this is harmless if it occurs once at the start per stream) [h264 @ 0x2026050] nal_unit_type: 1, nal_ref_idc: 2 [h264 @ 0x2060f00] nal_unit_type: 1, nal_ref_idc: 0 cur_dts is invalid (this is harmless if it occurs once at the start per stream) [h264 @ 0x20191e0] nal_unit_type: 1, nal_ref_idc: 2 [Parsed_fps_0 @ 0x202bf50] Setting 'fps' to value '1/10' [Parsed_fps_0 @ 0x202bf50] fps=1/10 [graph 0 input from stream 0:0 @ 0x202c6f0] Setting 'video_size' to value '1280x544' [graph 0 input from stream 0:0 @ 0x202c6f0] Setting 'pix_fmt' to value '0' [graph 0 input from stream 0:0 @ 0x202c6f0] Setting 'time_base' to value '1/24000' [graph 0 input from stream 0:0 @ 0x202c6f0] Setting 'pixel_aspect' to value '0/1' [graph 0 input from stream 0:0 @ 0x202c6f0] Setting 'sws_param' to value 'flags=2' [graph 0 input from stream 0:0 @ 0x202c6f0] Setting 'frame_rate' to value '24000/1001' [graph 0 input from stream 0:0 @ 0x202c6f0] w:1280 h:544 pixfmt:yuv420p tb:1/24000 fr:24000/1001 sar:0/1 sws_param:flags=2 [format @ 0x202c030] compat: called with args=[yuvj420p|yuvj422p|yuvj444p] [format @ 0x202c030] Setting 'pix_fmts' to value 'yuvj420p|yuvj422p|yuvj444p' [auto_scaler_0 @ 0x202c660] Setting 'flags' to value 'bicubic' [auto_scaler_0 @ 0x202c660] w:iw h:ih flags:'bicubic' interl:0 [format @ 0x202c030] auto-inserting filter 'auto_scaler_0' between the filter 'Parsed_fps_0' and the filter 'format' [AVFilterGraph @ 0x202bb80] query_formats: 4 queried, 2 merged, 1 already done, 0 delayed [auto_scaler_0 @ 0x202c660] picking yuvj420p out of 3 ref:yuv420p alpha:0 [swscaler @ 0x202cf00] deprecated pixel format used, make sure you did set range correctly [auto_scaler_0 @ 0x202c660] w:1280 h:544 fmt:yuv420p sar:0/1 -> w:1280 h:544 fmt:yuvj420p sar:0/1 flags:0x4 [mjpeg @ 0x1ff5f90] Forcing thread count to 1 for MJPEG encoding, use -thread_type slice or a constant quantizer if you want to use multiple cpu cores [mjpeg @ 0x1ff5f90] intra_quant_bias = 96 inter_quant_bias = 0 Output #0, image2, to './out_img/ffmpeg-msa_x264_%d.jpg': Metadata: major_brand : isom minor_version : 1 compatible_brands: isomavc1 encoder : Lavf57.71.100 Stream #0:0(und), 0, 10/1: Video: mjpeg, yuvj420p(pc), 1280x544, q=2-31, 200 kb/s, 0.10 fps, 0.10 tbn, 0.10 tbc (default) Metadata: creation_time : 2007-02-19T05:03:04.000000Z handler_name : GPAC ISO Video Handler encoder : Lavc57.89.100 mjpeg Side data: cpb: bitrate max/min/avg: 0/0/200000 buffer size: 0 vbv_delay: -1 cur_dts is invalid (this is harmless if it occurs once at the start per stream) [Parsed_fps_0 @ 0x202bf50] Dropping 1 frame(s). cur_dts is invalid (this is harmless if it occurs once at the start per stream) ... [h264 @ 0x2060f00] nal_unit_type: 1, nal_ref_idc: 2 [Parsed_fps_0 @ 0x202bf50] Dropping 1 frame(s). [Parsed_fps_0 @ 0x202bf50] Dropping 1 frame(s). [Parsed_fps_0 @ 0x202bf50] Dropping 1 frame(s). [file @ 0x209dc40] Setting default whitelist 'file,crypto' [AVIOContext @ 0x2189ab0] Statistics: 0 seeks, 1 writeouts frame= 15 fps=0.2 q=1.6 size=N/A time=00:02:30.00 bitrate=N/A speed=2.03x No more output streams to write to, finishing. frame= 15 fps=0.2 q=1.6 Lsize=N/A time=00:02:30.00 bitrate=N/A speed=2.03x video:1382kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown Input file #0 (./The Simpsons Movie - Trailer_x264.mp4): Input stream #0:0 (video): 3288 packets read (72364468 bytes); 3288 frames decoded; Input stream #0:1 (audio): 1 packets read (134 bytes); Total: 3289 packets (72364602 bytes) demuxed Output file #0 (./out_img/ffmpeg-msa_x264_%d.jpg): Output stream #0:0 (video): 15 frames encoded; 15 packets muxed (1414925 bytes); Total: 15 packets (1414925 bytes) muxed bench: utime=113.070s 3288 frames successfully decoded, 0 decoding errors bench: maxrss=39264kB [Parsed_fps_0 @ 0x202bf50] 3288 frames in, 15 frames out; 3273 frames dropped, 0 frames duplicated. [AVIOContext @ 0x1fd6230] Statistics: 73517562 bytes read, 5 seeks

x265 MIPS SIMD (, )

 ffmpeg started on 2010-10-18 at 00:27:58 Report written to "ffmpeg-20101018-002758.log" Command line: ./ffmpeg-mips/ffmpeg-msa/bin/ffmpeg -i ./Tears_400_x265.mp4 -vf "fps=1" "./out_img/ffmpeg-msa_x265_%d.jpg" -report -benchmark ffmpeg version 3.3 Copyright (c) 2000-2017 the FFmpeg developers built with gcc 4.9.2 (Codescape GNU Tools 2016.05-03 for MIPS MTI Linux) configuration: --enable-cross-compile --prefix=../ffmpeg-msa --cross-prefix=/home/stas/mipsfpga/toolchain/mips-mti-linux-gnu/2016.05-03/bin/mips-mti-linux-gnu- --arch=mips --cpu=p5600 --target-os=linux --extra-cflags='-EL -static' --extra-ldflags='-EL -static' --disable-iconv libavutil 55. 58.100 / 55. 58.100 libavcodec 57. 89.100 / 57. 89.100 libavformat 57. 71.100 / 57. 71.100 libavdevice 57. 6.100 / 57. 6.100 libavfilter 6. 82.100 / 6. 82.100 libswscale 4. 6.100 / 4. 6.100 libswresample 2. 7.100 / 2. 7.100 Splitting the commandline. Reading option '-i' ... matched as input url with argument './Tears_400_x265.mp4'. Reading option '-vf' ... matched as option 'vf' (set video filters) with argument 'fps=1'. Reading option './out_img/ffmpeg-msa_x265_%d.jpg' ... matched as output url. Reading option '-report' ... matched as option 'report' (generate a report) with argument '1'. Reading option '-benchmark' ... matched as option 'benchmark' (add timings for benchmarking) with argument '1'. Finished splitting the commandline. Parsing a group of options: global . Applying option report (generate a report) with argument 1. Applying option benchmark (add timings for benchmarking) with argument 1. Successfully parsed a group of options. Parsing a group of options: input url ./Tears_400_x265.mp4. Successfully parsed a group of options. Opening an input file: ./Tears_400_x265.mp4. [file @ 0x1fce0e0] Setting default whitelist 'file,crypto' [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1fcd980] Format mov,mp4,m4a,3gp,3g2,mj2 probed with size=2048 and score=100 [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1fcd980] ISO: File Type Major Brand: iso4 [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1fcd980] Unknown dref type 0x206c7275 size 12 [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1fcd980] Before avformat_find_stream_info() pos: 705972 bytes read:32827 seeks:1 nb_streams:1 [hevc @ 0x1fceca0] nal_unit_type: 32(VPS), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1fceca0] Decoding VPS [hevc @ 0x1fceca0] Main profile bitstream [hevc @ 0x1fceca0] nal_unit_type: 33(SPS), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1fceca0] Decoding SPS [hevc @ 0x1fceca0] Main profile bitstream [hevc @ 0x1fceca0] Decoding VUI [hevc @ 0x1fceca0] nal_unit_type: 34(PPS), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1fceca0] Decoding PPS [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1fcd980] All info found [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1fcd980] After avformat_find_stream_info() pos: 20299 bytes read:65595 seeks:2 frames:1 Input #0, mov,mp4,m4a,3gp,3g2,mj2, from './Tears_400_x265.mp4': Metadata: major_brand : iso4 minor_version : 1 compatible_brands: iso4hvc1 creation_time : 2014-08-25T18:10:46.000000Z Duration: 00:00:13.96, start: 0.125000, bitrate: 404 kb/s Stream #0:0(und), 1, 1/24000: Video: hevc (Main) (hvc1 / 0x31637668), yuv420p(tv), 1920x800, 402 kb/s, 24 fps, 24 tbr, 24k tbn, 24 tbc (default) Metadata: creation_time : 2014-08-25T18:10:46.000000Z handler_name : hevc:fps=24@GPAC0.5.1-DEV-rev4807 Successfully opened the file. Parsing a group of options: output url ./out_img/ffmpeg-msa_x265_%d.jpg. Applying option vf (set video filters) with argument fps=1. Successfully parsed a group of options. Opening an output file: ./out_img/ffmpeg-msa_x265_%d.jpg. Successfully opened the file. detected 2 logical cores [hevc @ 0x1fe5a00] nal_unit_type: 32(VPS), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1fe5a00] Decoding VPS [hevc @ 0x1fe5a00] Main profile bitstream [hevc @ 0x1fe5a00] nal_unit_type: 33(SPS), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1fe5a00] Decoding SPS [hevc @ 0x1fe5a00] Main profile bitstream [hevc @ 0x1fe5a00] Decoding VUI [hevc @ 0x1fe5a00] nal_unit_type: 34(PPS), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1fe5a00] Decoding PPS Stream mapping: Stream #0:0 -> #0:0 (hevc (native) -> mjpeg (native)) Press [q] to stop, [?] for help cur_dts is invalid (this is harmless if it occurs once at the start per stream) cur_dts is invalid (this is harmless if it occurs once at the start per stream) [hevc @ 0x1fe5a00] nal_unit_type: 39(SEI_PREFIX), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1fe5a00] nal_unit_type: 39(SEI_PREFIX), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1fe5a00] nal_unit_type: 19(IDR_W_RADL), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1fe5a00] Decoding SEI [hevc @ 0x1fe5a00] Skipped PREFIX SEI 5 [hevc @ 0x1fe5a00] Decoding SEI [hevc @ 0x1fe5a00] Skipped PREFIX SEI 6 cur_dts is invalid (this is harmless if it occurs once at the start per stream) [hevc @ 0x1ffeba0] nal_unit_type: 1(TRAIL_R), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x200c4d0] nal_unit_type: 1(TRAIL_R), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x200c4d0] Output frame with POC 0. [hevc @ 0x1fe5a00] Decoded frame with POC 0. cur_dts is invalid (this is harmless if it occurs once at the start per stream) [hevc @ 0x1fe5a00] nal_unit_type: 0(TRAIL_N), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1fe5a00] Output frame with POC 1. [hevc @ 0x1ffeba0] Decoded frame with POC 5. cur_dts is invalid (this is harmless if it occurs once at the start per stream) [hevc @ 0x1ffeba0] nal_unit_type: 0(TRAIL_N), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1ffeba0] Output frame with POC 2. [hevc @ 0x200c4d0] Decoded frame with POC 3. [Parsed_fps_0 @ 0x201a6c0] Setting 'fps' to value '1' [Parsed_fps_0 @ 0x201a6c0] fps=1/1 [graph 0 input from stream 0:0 @ 0x201abb0] Setting 'video_size' to value '1920x800' [graph 0 input from stream 0:0 @ 0x201abb0] Setting 'pix_fmt' to value '0' [graph 0 input from stream 0:0 @ 0x201abb0] Setting 'time_base' to value '1/24000' [graph 0 input from stream 0:0 @ 0x201abb0] Setting 'pixel_aspect' to value '0/1' [graph 0 input from stream 0:0 @ 0x201abb0] Setting 'sws_param' to value 'flags=2' [graph 0 input from stream 0:0 @ 0x201abb0] Setting 'frame_rate' to value '24/1' [graph 0 input from stream 0:0 @ 0x201abb0] w:1920 h:800 pixfmt:yuv420p tb:1/24000 fr:24/1 sar:0/1 sws_param:flags=2 [format @ 0x201aad0] compat: called with args=[yuvj420p|yuvj422p|yuvj444p] [format @ 0x201aad0] Setting 'pix_fmts' to value 'yuvj420p|yuvj422p|yuvj444p' [auto_scaler_0 @ 0x201a350] Setting 'flags' to value 'bicubic' [auto_scaler_0 @ 0x201a350] w:iw h:ih flags:'bicubic' interl:0 [format @ 0x201aad0] auto-inserting filter 'auto_scaler_0' between the filter 'Parsed_fps_0' and the filter 'format' [AVFilterGraph @ 0x201a2f0] query_formats: 4 queried, 2 merged, 1 already done, 0 delayed [auto_scaler_0 @ 0x201a350] picking yuvj420p out of 3 ref:yuv420p alpha:0 [swscaler @ 0x21d7da0] deprecated pixel format used, make sure you did set range correctly [auto_scaler_0 @ 0x201a350] w:1920 h:800 fmt:yuv420p sar:0/1 -> w:1920 h:800 fmt:yuvj420p sar:0/1 flags:0x4 [mjpeg @ 0x1fe2ba0] Forcing thread count to 1 for MJPEG encoding, use -thread_type slice or a constant quantizer if you want to use multiple cpu cores [mjpeg @ 0x1fe2ba0] intra_quant_bias = 96 inter_quant_bias = 0 Output #0, image2, to './out_img/ffmpeg-msa_x265_%d.jpg': Metadata: major_brand : iso4 minor_version : 1 compatible_brands: iso4hvc1 encoder : Lavf57.71.100 Stream #0:0(und), 0, 1/1: Video: mjpeg, yuvj420p(pc), 1920x800, q=2-31, 200 kb/s, 1 fps, 1 tbn, 1 tbc (default) Metadata: creation_time : 2014-08-25T18:10:46.000000Z handler_name : hevc:fps=24@GPAC0.5.1-DEV-rev4807 encoder : Lavc57.89.100 mjpeg Side data: cpb: bitrate max/min/avg: 0/0/200000 buffer size: 0 vbv_delay: -1 cur_dts is invalid (this is harmless if it occurs once at the start per stream) ... No more output streams to write to, finishing. frame= 15 fps=0.7 q=24.8 Lsize=N/A time=00:00:15.00 bitrate=N/A speed=0.668x video:1084kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown Input file #0 (./Tears_400_x265.mp4): Input stream #0:0 (video): 335 packets read (701773 bytes); 335 frames decoded; Total: 335 packets (701773 bytes) demuxed Output file #0 (./out_img/ffmpeg-msa_x265_%d.jpg): Output stream #0:0 (video): 15 frames encoded; 15 packets muxed (1109604 bytes); Total: 15 packets (1109604 bytes) muxed bench: utime=22.300s 335 frames successfully decoded, 0 decoding errors bench: maxrss=72432kB [Parsed_fps_0 @ 0x201a6c0] 335 frames in, 15 frames out; 320 frames dropped, 0 frames duplicated. [AVIOContext @ 0x1fd6220] Statistics: 734659 bytes read, 2 seeks

x264 MIPS SIMD (, )

 ffmpeg started on 2010-10-18 at 00:28:31 Report written to "ffmpeg-20101018-002831.log" Command line: ./ffmpeg-mips/ffmpeg-soft/bin/ffmpeg -i "./The Simpsons Movie - Trailer_x264.mp4" -vf "fps=1/10" "./out_img/ffmpeg-soft_x264_%d.jpg" -report -benchmark ffmpeg version 3.3 Copyright (c) 2000-2017 the FFmpeg developers built with gcc 4.9.2 (Codescape GNU Tools 2016.05-03 for MIPS MTI Linux) configuration: --enable-cross-compile --prefix=../ffmpeg-soft --cross-prefix=/home/stas/mipsfpga/toolchain/mips-mti-linux-gnu/2016.05-03/bin/mips-mti-linux-gnu- --arch=mips --cpu=p5600 --target-os=linux --extra-cflags='-EL -static' --extra-ldflags='-EL -static' --disable-iconv --disable-msa libavutil 55. 58.100 / 55. 58.100 libavcodec 57. 89.100 / 57. 89.100 libavformat 57. 71.100 / 57. 71.100 libavdevice 57. 6.100 / 57. 6.100 libavfilter 6. 82.100 / 6. 82.100 libswscale 4. 6.100 / 4. 6.100 libswresample 2. 7.100 / 2. 7.100 Splitting the commandline. Reading option '-i' ... matched as input url with argument './The Simpsons Movie - Trailer_x264.mp4'. Reading option '-vf' ... matched as option 'vf' (set video filters) with argument 'fps=1/10'. Reading option './out_img/ffmpeg-soft_x264_%d.jpg' ... matched as output url. Reading option '-report' ... matched as option 'report' (generate a report) with argument '1'. Reading option '-benchmark' ... matched as option 'benchmark' (add timings for benchmarking) with argument '1'. Finished splitting the commandline. Parsing a group of options: global . Applying option report (generate a report) with argument 1. Applying option benchmark (add timings for benchmarking) with argument 1. Successfully parsed a group of options. Parsing a group of options: input url ./The Simpsons Movie - Trailer_x264.mp4. Successfully parsed a group of options. Opening an input file: ./The Simpsons Movie - Trailer_x264.mp4. [file @ 0x1f4a0e0] Setting default whitelist 'file,crypto' [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1f49980] Format mov,mp4,m4a,3gp,3g2,mj2 probed with size=2048 and score=100 [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1f49980] ISO: File Type Major Brand: isom [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1f49980] Unknown dref type 0x206c7275 size 12 [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1f49980] Unknown dref type 0x206c7275 size 12 [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1f49980] Before avformat_find_stream_info() pos: 73516232 bytes read:65587 seeks:1 nb_streams:2 [h264 @ 0x1f4acb0] nal_unit_type: 7, nal_ref_idc: 3 [h264 @ 0x1f4acb0] nal_unit_type: 8, nal_ref_idc: 3 [h264 @ 0x1f4acb0] nal_unit_type: 6, nal_ref_idc: 0 [h264 @ 0x1f4acb0] nal_unit_type: 5, nal_ref_idc: 3 [h264 @ 0x1f4acb0] user data:"x264 - core 54 svn-620M - H.264/MPEG-4 AVC codec - Copyleft 2005 - http://www.videolan.org/x264.html - options: cabac=1 ref=5 deblock=1:0:0 analyse=0x1:0x131 me=umh subme=6 brdo=1 mixed_ref=0 me_range=16 chroma_me=1 trellis=1 8x8dct=0 cqm=0 deadzone=21,11 chroma_qp_offset=0 threads=1 nr=0 decimate=1 mbaff=0 bframes=1 b_pyramid=0 b_adapt=1 b_bias=0 direct=3 wpredb=0 bime=0 keyint=250 keyint_min=25 scenecut=40 rc=2pass bitrate=4214 ratetol=1.0 rceq='blurCplx^(1-qComp)' qcomp=0.60 qpmin=10 qpmax=51 qpstep=4 cplxblur=20.0 qblur=0.5 ip_ratio=1.40 pb_ratio=1.30" [h264 @ 0x1f4acb0] Reinit context to 1280x544, pix_fmt: yuv420p [h264 @ 0x1f4acb0] no picture [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1f49980] All info found [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1f49980] After avformat_find_stream_info() pos: 94845 bytes read:141348 seeks:2 frames:13 Input #0, mov,mp4,m4a,3gp,3g2,mj2, from './The Simpsons Movie - Trailer_x264.mp4': Metadata: major_brand : isom minor_version : 1 compatible_brands: isomavc1 creation_time : 2007-02-19T05:03:04.000000Z Duration: 00:02:17.30, start: 0.000000, bitrate: 4283 kb/s Stream #0:0(und), 12, 1/24000: Video: h264 (Main) (avc1 / 0x31637661), yuv420p, 1280x544, 4221 kb/s, 23.98 fps, 23.98 tbr, 24k tbn, 47.95 tbc (default) Metadata: creation_time : 2007-02-19T05:03:04.000000Z handler_name : GPAC ISO Video Handler Stream #0:1(und), 1, 1/48000: Audio: aac (HE-AAC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 64 kb/s (default) Metadata: creation_time : 2007-02-19T05:03:08.000000Z handler_name : GPAC ISO Audio Handler Successfully opened the file. Parsing a group of options: output url ./out_img/ffmpeg-soft_x264_%d.jpg. Applying option vf (set video filters) with argument fps=1/10. Successfully parsed a group of options. Opening an output file: ./out_img/ffmpeg-soft_x264_%d.jpg. Successfully opened the file. detected 2 logical cores [h264 @ 0x1f951e0] nal_unit_type: 7, nal_ref_idc: 3 [h264 @ 0x1f951e0] nal_unit_type: 8, nal_ref_idc: 3 Stream mapping: Stream #0:0 -> #0:0 (h264 (native) -> mjpeg (native)) Press [q] to stop, [?] for help cur_dts is invalid (this is harmless if it occurs once at the start per stream) cur_dts is invalid (this is harmless if it occurs once at the start per stream) [h264 @ 0x1f951e0] nal_unit_type: 6, nal_ref_idc: 0 [h264 @ 0x1f951e0] nal_unit_type: 5, nal_ref_idc: 3 [h264 @ 0x1f951e0] user data:"x264 - core 54 svn-620M - H.264/MPEG-4 AVC codec - Copyleft 2005 - http://www.videolan.org/x264.html - options: cabac=1 ref=5 deblock=1:0:0 analyse=0x1:0x131 me=umh subme=6 brdo=1 mixed_ref=0 me_range=16 chroma_me=1 trellis=1 8x8dct=0 cqm=0 deadzone=21,11 chroma_qp_offset=0 threads=1 nr=0 decimate=1 mbaff=0 bframes=1 b_pyramid=0 b_adapt=1 b_bias=0 direct=3 wpredb=0 bime=0 keyint=250 keyint_min=25 scenecut=40 rc=2pass bitrate=4214 ratetol=1.0 rceq='blurCplx^(1-qComp)' qcomp=0.60 qpmin=10 qpmax=51 qpstep=4 cplxblur=20.0 qblur=0.5 ip_ratio=1.40 pb_ratio=1.30" [h264 @ 0x1f951e0] Reinit context to 1280x544, pix_fmt: yuv420p [h264 @ 0x1f951e0] no picture cur_dts is invalid (this is harmless if it occurs once at the start per stream) [h264 @ 0x1fa2050] nal_unit_type: 1, nal_ref_idc: 2 [h264 @ 0x1fdcf00] nal_unit_type: 1, nal_ref_idc: 0 cur_dts is invalid (this is harmless if it occurs once at the start per stream) [h264 @ 0x1f951e0] nal_unit_type: 1, nal_ref_idc: 2 [Parsed_fps_0 @ 0x1fa7f50] Setting 'fps' to value '1/10' [Parsed_fps_0 @ 0x1fa7f50] fps=1/10 [graph 0 input from stream 0:0 @ 0x1fa86f0] Setting 'video_size' to value '1280x544' [graph 0 input from stream 0:0 @ 0x1fa86f0] Setting 'pix_fmt' to value '0' [graph 0 input from stream 0:0 @ 0x1fa86f0] Setting 'time_base' to value '1/24000' [graph 0 input from stream 0:0 @ 0x1fa86f0] Setting 'pixel_aspect' to value '0/1' [graph 0 input from stream 0:0 @ 0x1fa86f0] Setting 'sws_param' to value 'flags=2' [graph 0 input from stream 0:0 @ 0x1fa86f0] Setting 'frame_rate' to value '24000/1001' [graph 0 input from stream 0:0 @ 0x1fa86f0] w:1280 h:544 pixfmt:yuv420p tb:1/24000 fr:24000/1001 sar:0/1 sws_param:flags=2 [format @ 0x1fa8030] compat: called with args=[yuvj420p|yuvj422p|yuvj444p] [format @ 0x1fa8030] Setting 'pix_fmts' to value 'yuvj420p|yuvj422p|yuvj444p' [auto_scaler_0 @ 0x1fa8660] Setting 'flags' to value 'bicubic' [auto_scaler_0 @ 0x1fa8660] w:iw h:ih flags:'bicubic' interl:0 [format @ 0x1fa8030] auto-inserting filter 'auto_scaler_0' between the filter 'Parsed_fps_0' and the filter 'format' [AVFilterGraph @ 0x1fa7b80] query_formats: 4 queried, 2 merged, 1 already done, 0 delayed [auto_scaler_0 @ 0x1fa8660] picking yuvj420p out of 3 ref:yuv420p alpha:0 [swscaler @ 0x1fa8f00] deprecated pixel format used, make sure you did set range correctly [auto_scaler_0 @ 0x1fa8660] w:1280 h:544 fmt:yuv420p sar:0/1 -> w:1280 h:544 fmt:yuvj420p sar:0/1 flags:0x4 [mjpeg @ 0x1f71f90] Forcing thread count to 1 for MJPEG encoding, use -thread_type slice or a constant quantizer if you want to use multiple cpu cores [mjpeg @ 0x1f71f90] intra_quant_bias = 96 inter_quant_bias = 0 Output #0, image2, to './out_img/ffmpeg-soft_x264_%d.jpg': Metadata: major_brand : isom minor_version : 1 compatible_brands: isomavc1 encoder : Lavf57.71.100 Stream #0:0(und), 0, 10/1: Video: mjpeg, yuvj420p(pc), 1280x544, q=2-31, 200 kb/s, 0.10 fps, 0.10 tbn, 0.10 tbc (default) Metadata: creation_time : 2007-02-19T05:03:04.000000Z handler_name : GPAC ISO Video Handler encoder : Lavc57.89.100 mjpeg Side data: cpb: bitrate max/min/avg: 0/0/200000 buffer size: 0 vbv_delay: -1 cur_dts is invalid (this is harmless if it occurs once at the start per stream) [Parsed_fps_0 @ 0x1fa7f50] Dropping 1 frame(s). [h264 @ 0x1fa2050] nal_unit_type: 1, nal_ref_idc: 0 ... [AVIOContext @ 0x2229af0] Statistics: 0 seeks, 1 writeouts No more output streams to write to, finishing. frame= 15 fps=0.1 q=1.6 Lsize=N/A time=00:02:30.00 bitrate=N/A speed=1.45x video:1382kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown Input file #0 (./The Simpsons Movie - Trailer_x264.mp4): Input stream #0:0 (video): 3288 packets read (72364468 bytes); 3288 frames decoded; Input stream #0:1 (audio): 1 packets read (134 bytes); Total: 3289 packets (72364602 bytes) demuxed Output file #0 (./out_img/ffmpeg-soft_x264_%d.jpg): Output stream #0:0 (video): 15 frames encoded; 15 packets muxed (1414925 bytes); Total: 15 packets (1414925 bytes) muxed bench: utime=164.240s 3288 frames successfully decoded, 0 decoding errors bench: maxrss=39936kB [Parsed_fps_0 @ 0x1fa7f50] 3288 frames in, 15 frames out; 3273 frames dropped, 0 frames duplicated. [AVIOContext @ 0x1f52230] Statistics: 73517562 bytes read, 5 seeks

x265 MIPS SIMD (, )

 ffmpeg started on 2010-10-18 at 00:27:14 Report written to "ffmpeg-20101018-002714.log" Command line: ./ffmpeg-mips/ffmpeg-soft/bin/ffmpeg -i ./Tears_400_x265.mp4 -vf "fps=1" "./out_img/ffmpeg-soft_x265_%d.jpg" -report -benchmark ffmpeg version 3.3 Copyright (c) 2000-2017 the FFmpeg developers built with gcc 4.9.2 (Codescape GNU Tools 2016.05-03 for MIPS MTI Linux) configuration: --enable-cross-compile --prefix=../ffmpeg-soft --cross-prefix=/home/stas/mipsfpga/toolchain/mips-mti-linux-gnu/2016.05-03/bin/mips-mti-linux-gnu- --arch=mips --cpu=p5600 --target-os=linux --extra-cflags='-EL -static' --extra-ldflags='-EL -static' --disable-iconv --disable-msa libavutil 55. 58.100 / 55. 58.100 libavcodec 57. 89.100 / 57. 89.100 libavformat 57. 71.100 / 57. 71.100 libavdevice 57. 6.100 / 57. 6.100 libavfilter 6. 82.100 / 6. 82.100 libswscale 4. 6.100 / 4. 6.100 libswresample 2. 7.100 / 2. 7.100 Splitting the commandline. Reading option '-i' ... matched as input url with argument './Tears_400_x265.mp4'. Reading option '-vf' ... matched as option 'vf' (set video filters) with argument 'fps=1'. Reading option './out_img/ffmpeg-soft_x265_%d.jpg' ... matched as output url. Reading option '-report' ... matched as option 'report' (generate a report) with argument '1'. Reading option '-benchmark' ... matched as option 'benchmark' (add timings for benchmarking) with argument '1'. Finished splitting the commandline. Parsing a group of options: global . Applying option report (generate a report) with argument 1. Applying option benchmark (add timings for benchmarking) with argument 1. Successfully parsed a group of options. Parsing a group of options: input url ./Tears_400_x265.mp4. Successfully parsed a group of options. Opening an input file: ./Tears_400_x265.mp4. [file @ 0x1f4a0e0] Setting default whitelist 'file,crypto' [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1f49980] Format mov,mp4,m4a,3gp,3g2,mj2 probed with size=2048 and score=100 [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1f49980] ISO: File Type Major Brand: iso4 [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1f49980] Unknown dref type 0x206c7275 size 12 [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1f49980] Before avformat_find_stream_info() pos: 705972 bytes read:32827 seeks:1 nb_streams:1 [hevc @ 0x1f4aca0] nal_unit_type: 32(VPS), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1f4aca0] Decoding VPS [hevc @ 0x1f4aca0] Main profile bitstream [hevc @ 0x1f4aca0] nal_unit_type: 33(SPS), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1f4aca0] Decoding SPS [hevc @ 0x1f4aca0] Main profile bitstream [hevc @ 0x1f4aca0] Decoding VUI [hevc @ 0x1f4aca0] nal_unit_type: 34(PPS), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1f4aca0] Decoding PPS [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1f49980] All info found [mov,mp4,m4a,3gp,3g2,mj2 @ 0x1f49980] After avformat_find_stream_info() pos: 20299 bytes read:65595 seeks:2 frames:1 Input #0, mov,mp4,m4a,3gp,3g2,mj2, from './Tears_400_x265.mp4': Metadata: major_brand : iso4 minor_version : 1 compatible_brands: iso4hvc1 creation_time : 2014-08-25T18:10:46.000000Z Duration: 00:00:13.96, start: 0.125000, bitrate: 404 kb/s Stream #0:0(und), 1, 1/24000: Video: hevc (Main) (hvc1 / 0x31637668), yuv420p(tv), 1920x800, 402 kb/s, 24 fps, 24 tbr, 24k tbn, 24 tbc (default) Metadata: creation_time : 2014-08-25T18:10:46.000000Z handler_name : hevc:fps=24@GPAC0.5.1-DEV-rev4807 Successfully opened the file. Parsing a group of options: output url ./out_img/ffmpeg-soft_x265_%d.jpg. Applying option vf (set video filters) with argument fps=1. Successfully parsed a group of options. Opening an output file: ./out_img/ffmpeg-soft_x265_%d.jpg. Successfully opened the file. detected 2 logical cores [hevc @ 0x1f61a00] nal_unit_type: 32(VPS), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1f61a00] Decoding VPS [hevc @ 0x1f61a00] Main profile bitstream [hevc @ 0x1f61a00] nal_unit_type: 33(SPS), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1f61a00] Decoding SPS [hevc @ 0x1f61a00] Main profile bitstream [hevc @ 0x1f61a00] Decoding VUI [hevc @ 0x1f61a00] nal_unit_type: 34(PPS), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1f61a00] Decoding PPS Stream mapping: Stream #0:0 -> #0:0 (hevc (native) -> mjpeg (native)) Press [q] to stop, [?] for help cur_dts is invalid (this is harmless if it occurs once at the start per stream) cur_dts is invalid (this is harmless if it occurs once at the start per stream) [hevc @ 0x1f61a00] nal_unit_type: 39(SEI_PREFIX), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1f61a00] nal_unit_type: 39(SEI_PREFIX), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1f61a00] nal_unit_type: 19(IDR_W_RADL), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1f61a00] Decoding SEI [hevc @ 0x1f61a00] Skipped PREFIX SEI 5 [hevc @ 0x1f61a00] Decoding SEI [hevc @ 0x1f61a00] Skipped PREFIX SEI 6 cur_dts is invalid (this is harmless if it occurs once at the start per stream) [hevc @ 0x1f7aba0] nal_unit_type: 1(TRAIL_R), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1f884d0] nal_unit_type: 1(TRAIL_R), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1f884d0] Output frame with POC 0. [hevc @ 0x1f61a00] Decoded frame with POC 0. cur_dts is invalid (this is harmless if it occurs once at the start per stream) [hevc @ 0x1f61a00] nal_unit_type: 0(TRAIL_N), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1f61a00] Output frame with POC 1. [hevc @ 0x1f7aba0] Decoded frame with POC 5. cur_dts is invalid (this is harmless if it occurs once at the start per stream) [hevc @ 0x1f7aba0] nal_unit_type: 0(TRAIL_N), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1f7aba0] Output frame with POC 2. [hevc @ 0x1f884d0] Decoded frame with POC 3. [Parsed_fps_0 @ 0x1f966c0] Setting 'fps' to value '1' [Parsed_fps_0 @ 0x1f966c0] fps=1/1 [graph 0 input from stream 0:0 @ 0x1f96bb0] Setting 'video_size' to value '1920x800' [graph 0 input from stream 0:0 @ 0x1f96bb0] Setting 'pix_fmt' to value '0' [graph 0 input from stream 0:0 @ 0x1f96bb0] Setting 'time_base' to value '1/24000' [graph 0 input from stream 0:0 @ 0x1f96bb0] Setting 'pixel_aspect' to value '0/1' [graph 0 input from stream 0:0 @ 0x1f96bb0] Setting 'sws_param' to value 'flags=2' [graph 0 input from stream 0:0 @ 0x1f96bb0] Setting 'frame_rate' to value '24/1' [graph 0 input from stream 0:0 @ 0x1f96bb0] w:1920 h:800 pixfmt:yuv420p tb:1/24000 fr:24/1 sar:0/1 sws_param:flags=2 [format @ 0x1f96ad0] compat: called with args=[yuvj420p|yuvj422p|yuvj444p] [format @ 0x1f96ad0] Setting 'pix_fmts' to value 'yuvj420p|yuvj422p|yuvj444p' [hevc @ 0x1f61a00] Decoded frame with POC 1. [hevc @ 0x1f7aba0] Decoded frame with POC 2. [auto_scaler_0 @ 0x1f96350] Setting 'flags' to value 'bicubic' [auto_scaler_0 @ 0x1f96350] w:iw h:ih flags:'bicubic' interl:0 [format @ 0x1f96ad0] auto-inserting filter 'auto_scaler_0' between the filter 'Parsed_fps_0' and the filter 'format' [AVFilterGraph @ 0x1f962f0] query_formats: 4 queried, 2 merged, 1 already done, 0 delayed [auto_scaler_0 @ 0x1f96350] picking yuvj420p out of 3 ref:yuv420p alpha:0 [swscaler @ 0x2153da0] deprecated pixel format used, make sure you did set range correctly [auto_scaler_0 @ 0x1f96350] w:1920 h:800 fmt:yuv420p sar:0/1 -> w:1920 h:800 fmt:yuvj420p sar:0/1 flags:0x4 [mjpeg @ 0x1f5eba0] Forcing thread count to 1 for MJPEG encoding, use -thread_type slice or a constant quantizer if you want to use multiple cpu cores [mjpeg @ 0x1f5eba0] intra_quant_bias = 96 inter_quant_bias = 0 Output #0, image2, to './out_img/ffmpeg-soft_x265_%d.jpg': Metadata: major_brand : iso4 minor_version : 1 compatible_brands: iso4hvc1 encoder : Lavf57.71.100 Stream #0:0(und), 0, 1/1: Video: mjpeg, yuvj420p(pc), 1920x800, q=2-31, 200 kb/s, 1 fps, 1 tbn, 1 tbc (default) Metadata: creation_time : 2014-08-25T18:10:46.000000Z handler_name : hevc:fps=24@GPAC0.5.1-DEV-rev4807 encoder : Lavc57.89.100 mjpeg Side data: cpb: bitrate max/min/avg: 0/0/200000 buffer size: 0 vbv_delay: -1 cur_dts is invalid (this is harmless if it occurs once at the start per stream) [hevc @ 0x1f884d0] nal_unit_type: 0(TRAIL_N), nuh_layer_id: 0, temporal_id: 0 [hevc @ 0x1f884d0] Output frame with POC 3. [Parsed_fps_0 @ 0x1f966c0] Dropping 1 frame(s). frame= 0 fps=0.0 q=0.0 size=N/A time=00:00:00.00 bitrate=N/A speed= 0x cur_dts is invalid (this is harmless if it occurs once at the start per stream) [Parsed_fps_0 @ 0x1f966c0] Dropping 1 frame(s). [hevc @ 0x1f61a00] nal_unit_type: 1(TRAIL_R), nuh_layer_id: 0, temporal_id: 0 cur_dts is invalid (this is harmless if it occurs once at the start per stream) [hevc @ 0x1f61a00] Output frame with POC 4. ... No more output streams to write to, finishing. frame= 15 fps=0.5 q=24.8 Lsize=N/A time=00:00:15.00 bitrate=N/A speed=0.451x video:1084kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown Input file #0 (./Tears_400_x265.mp4): Input stream #0:0 (video): 335 packets read (701773 bytes); 335 frames decoded; Total: 335 packets (701773 bytes) demuxed Output file #0 (./out_img/ffmpeg-soft_x265_%d.jpg): Output stream #0:0 (video): 15 frames encoded; 15 packets muxed (1109604 bytes); Total: 15 packets (1109604 bytes) muxed bench: utime=52.330s 335 frames successfully decoded, 0 decoding errors bench: maxrss=72480kB [Parsed_fps_0 @ 0x1f966c0] 335 frames in, 15 frames out; 320 frames dropped, 0 frames duplicated. [AVIOContext @ 0x1f52220] Statistics: 734659 bytes read, 2 seeks

findings

MIPS SIMD ;
— . - MIPS SIMD , . ;
-1 — , , , .

Links

[L1] — -1 ;
[L2] — MIPSfpga-plus github ;
[L3] — P-Class P5600 Multiprocessor Core ;
[L4] — MIPS, ;
[L5] — Texas Instruments. Digital Signal Processors ;
[L6] — GPU ;
[L7] — OpenCL. ;
[L8] — Wikipedia: ;
[L9] — TFilter. Free online FIR filter design tool ;
[L10] — Wikipedia: ;
[L11] — Wikipedia: - ;
[L12] — Wikipedia: - ;
[L13] — Wikipedia: SIMD ;
[L14] — Wikipedia: ;
[L15] — MIPS SIMD ;
[L16] — GCC: MIPS SIMD Architecture (MSA) Support ;
[L17] — GCC: MIPS SIMD Architecture Built-in Functions ;
[L18] — ffmpeg github ( libavcodec/mips/) ;
[L19] — FFmpeg multimedia framework ;
[L20] — Codescape MIPS SDK ;
[L21] — H.264 Demo Clips ;
[L22] — x256. Sample HEVC Video Files ;
[L23] — ffmpeg
[L24] — ;

Documentation

[D1] — ., . — ;
[D2] — MIPS Architecture for Programmers Volume IV-j: The MIPS32 SIMD Architecture Module ;
[D3] — MIPS SIMD programming. Optimizing multimedia codecs ;

[P1] — - -1 . (: L1 );
[P2] — TFilter. 1 ();
[P3] — TFilter. 1 ;
[P4] — TFilter. 2 ();
[P5] — TFilter. 2 ;
[P6] — SIMD- (: D3 );
[P7] — MSA Vector registers (: D2 );
[P8] — MADDV Operation description (: D2 );

Source: https://habr.com/ru/post/328566/

All Articles