The source code for this article and MK.js are on my GitHub . I have not laid out the training data set, but you can build your own and train the model as described below!
getUserMedia
API. In it, the user's camera served as a controller for playing a small JavaScript-clone of Mortal Kombat 3. You can find that game in the GitHub repository . As part of the experiment, I implemented a basic positioning algorithm, which classifies an image into the following classes:The algorithm photographs the background. As soon as the user appears in the frame, the algorithm calculates the difference between the background and the current frame with the user. So he determines the position of the user's figure. The next step is to display the user's body in white on black. After that, vertical and horizontal histograms are constructed, summarizing the values ​​for each pixel. Based on this calculation, the algorithm determines the current body position.
jpg
images:ffmpeg -i video.mov $filename%03d.jpg
ffmpeg
on your computer.1
to 190
, so the first image will be 1.jpg
, the second 2.jpg
, etc. np.random.seed(44) ia.seed(44) def main(): for i in range(1, 191): draw_single_sequential_images(str(i), "others", "others-aug") for i in range(1, 191): draw_single_sequential_images(str(i), "hits", "hits-aug") for i in range(1, 191): draw_single_sequential_images(str(i), "kicks", "kicks-aug") def draw_single_sequential_images(filename, path, aug_path): image = misc.imresize(ndimage.imread(path + "/" + filename + ".jpg"), (56, 100)) sometimes = lambda aug: iaa.Sometimes(0.5, aug) seq = iaa.Sequential( [ iaa.Fliplr(0.5), # horizontally flip 50% of all images # crop images by -5% to 10% of their height/width sometimes(iaa.CropAndPad( percent=(-0.05, 0.1), pad_mode=ia.ALL, pad_cval=(0, 255) )), sometimes(iaa.Affine( scale={"x": (0.8, 1.2), "y": (0.8, 1.2)}, # scale images to 80-120% of their size, individually per axis translate_percent={"x": (-0.1, 0.1), "y": (-0.1, 0.1)}, # translate by -10 to +10 percent (per axis) rotate=(-5, 5), shear=(-5, 5), # shear by -5 to +5 degrees order=[0, 1], # use nearest neighbour or bilinear interpolation (fast) cval=(0, 255), # if mode is constant, use a cval between 0 and 255 mode=ia.ALL # use any of scikit-image's warping modes (see 2nd image from the top for examples) )), iaa.Grayscale(alpha=(0.0, 1.0)), iaa.Invert(0.05, per_channel=False), # invert color channels # execute 0 to 5 of the following (less important) augmenters per image # don't execute all of them, as that would often be way too strong iaa.SomeOf((0, 5), [ iaa.OneOf([ iaa.GaussianBlur((0, 2.0)), # blur images with a sigma between 0 and 2.0 iaa.AverageBlur(k=(2, 5)), # blur image using local means with kernel sizes between 2 and 5 iaa.MedianBlur(k=(3, 5)), # blur image using local medians with kernel sizes between 3 and 5 ]), iaa.Sharpen(alpha=(0, 1.0), lightness=(0.75, 1.5)), # sharpen images iaa.Emboss(alpha=(0, 1.0), strength=(0, 2.0)), # emboss images iaa.AdditiveGaussianNoise(loc=0, scale=(0.0, 0.01*255), per_channel=0.5), # add gaussian noise to images iaa.Add((-10, 10), per_channel=0.5), # change brightness of images (by -10 to 10 of original value) iaa.AddToHueAndSaturation((-20, 20)), # change hue and saturation # either change the brightness of the whole image (sometimes # per channel) or change the brightness of subareas iaa.OneOf([ iaa.Multiply((0.9, 1.1), per_channel=0.5), iaa.FrequencyNoiseAlpha( exponent=(-2, 0), first=iaa.Multiply((0.9, 1.1), per_channel=True), second=iaa.ContrastNormalization((0.9, 1.1)) ) ]), iaa.ContrastNormalization((0.5, 2.0), per_channel=0.5), # improve or worsen the contrast ], random_order=True ) ], random_order=True ) im = np.zeros((16, 56, 100, 3), dtype=np.uint8) for c in range(0, 16): im[c] = image for im in range(len(grid)): misc.imsave(aug_path + "/" + filename + "_" + str(im) + ".jpg", grid[im])
main
method with three for
loops — one for each category of images. In each iteration, in each of the cycles, we call the draw_single_sequential_images
method: the first argument is the file name, the second is the path, and the third is the directory where to save the result.100x56
pixels. We do this to reduce the amount of data and, accordingly, the number of calculations that our model performs during training and evaluation.global_average_pooling2d_1
and pass it as input to our tiny model. Why did I choose this particular layer? Empirically. I did some tests and this layer works quite well. import * as tf from '@tensorflow/tfjs'; const model = tf.sequential(); model.add(tf.layers.inputLayer({ inputShape: [1024] })); model.add(tf.layers.dense({ units: 1024, activation: 'relu' })); model.add(tf.layers.dense({ units: 1, activation: 'sigmoid' })); model.compile({ optimizer: tf.train.adam(1e-6), loss: tf.losses.sigmoidCrossEntropy, metrics: ['accuracy'] });
1024
units and ReLU
activation, as well as one output unit that passes through the sigmoid
activation sigmoid
. The latter gives a number from 0
to 1
, depending on the probability of the presence of a hand strike in a given frame.1024
units for the second level and learning speed 1e-6
? Well, I tried several different options and saw that such parameters work best. The “spear method” does not seem to be the best approach, but to a large extent this is exactly how setting up hyper parameters in deep learning - based on our understanding of the model, we use intuition to update orthogonal parameters and empirically check how the model works.compile
method compiles the layers together, preparing a model for learning and evaluation. Here we announce that we want to use the adam
optimization algorithm. We also declare that we will calculate the loss (loss) from the cross entropy, and indicate that we want to evaluate the accuracy of the model. Then TensorFlow.js calculates the accuracy by the formula:Accuracy = (True Positives + True Negatives) / (Positives + Negatives)
model.json
file, which contains the model architecture - layers, activations, etc. The remaining files contain the parameters of the model. You can load a model from a file using this code: export const loadModel = async () => { const mn = new mobilenet.MobileNet(1, 1); mn.path = `file://PATH/TO/model.json`; await mn.load(); return (input): tf.Tensor1D => mn.infer(input, 'global_average_pooling2d_1') .reshape([1024]); };
loadModel
method loadModel
we return a function that takes a one-dimensional tensor as input and returns mn.infer(input, Layer)
. The infer
method takes a tensor and a layer as arguments. The layer determines which hidden layer we want to get the output from. If you open model.json and global_average_pooling2d_1
, you will find this name in one of the layers.infer
method in MobileNet and assign them tags: 1
for images with strokes and 0
for images without impact: const punches = require('fs') .readdirSync(Punches) .filter(f => f.endsWith('.jpg')) .map(f => `${Punches}/${f}`); const others = require('fs') .readdirSync(Others) .filter(f => f.endsWith('.jpg')) .map(f => `${Others}/${f}`); const ys = tf.tensor1d( new Array(punches.length).fill(1) .concat(new Array(others.length).fill(0))); const xs: tf.Tensor2D = tf.stack( punches .map((path: string) => mobileNet(readInput(path))) .concat(others.map((path: string) => mobileNet(readInput(path)))) ) as tf.Tensor2D;
n
images with hits and m
other images, the tensor will have n
elements with a value of 1 and m
elements with a value of 0.xs
we add the results of calling the infer
method for individual images. Notice that for each image we call the readInput
method. Here is its implementation: export const readInput = img => imageToInput(readImage(img), TotalChannels); const readImage = path => jpeg.decode(fs.readFileSync(path), true); const imageToInput = image => { const values = serializeImage(image); return tf.tensor3d(values, [image.height, image.width, 3], 'int32'); }; const serializeImage = image => { const totalPixels = image.width * image.height; const result = new Int32Array(totalPixels * 3); for (let i = 0; i < totalPixels; i++) { result[i * 3 + 0] = image.data[i * 4 + 0]; result[i * 3 + 1] = image.data[i * 4 + 1]; result[i * 3 + 2] = image.data[i * 4 + 2]; } return result; };
readInput
first calls the readImage
function, and then delegates its imageToInput
call. The readImage
function reads the image from the disk and then decodes the jpg from the buffer using the jpeg-js package. In imageToInput
we transform the image into a three-dimensional tensor.i
from 0
to TotalImages
, ys[i]
should be equal to 1
if xs[i]
corresponds to the image with a stroke, and 0
otherwise.fit
method: await model.fit(xs, ys, { epochs: Epochs, batchSize: parseInt(((punches.length + others.length) * BatchSize).toFixed(0)), callbacks: { onBatchEnd: async (_, logs) => { console.log('Cost: %s, accuracy: %s', logs.loss.toFixed(5), logs.acc.toFixed(5)); await tf.nextFrame(); } } });
fit
with three arguments: xs
, ys and a configuration object. In the configuration object, we set how many epochs the model will study, the packet size, and the callback that TensorFlow.js will generate after processing each packet.xs
and ys
for training the model for the same epoch. For each epoch, TensorFlow.js will select a subset of xs
and the corresponding elements from ys
, perform a direct distribution, obtain the output of the layer with sigmoid
activation, and then, based on the loss, perform optimization using the adam
algorithm.Cost: 0.84212, accuracy: 1.00000 eta = 0.3> ---------- acc = 1.00 loss = 0.84 Cost: 0.79740, accuracy: 1.00000 eta = 0.2 => --------- acc = 1.00 loss = 0.80 Cost: 0.81533, accuracy: 1.00000 eta = 0.2 ==> -------- acc = 1.00 loss = 0.82 Cost: 0.64303, accuracy: 0.50000 eta = 0.2 ===> ------- acc = 0.50 loss = 0.64 Cost: 0.51377, accuracy: 0.00000 eta = 0.2 ====> ------ acc = 0.00 loss = 0.51 Cost: 0.46473, accuracy: 0.50000 eta = 0.1 =====> ----- acc = 0.50 loss = 0.46 Cost: 0.50872, accuracy: 0.00000 eta = 0.1 ======> ---- acc = 0.00 loss = 0.51 Cost: 0.62556, accuracy: 1.00000 eta = 0.1 =======> --- acc = 1.00 loss = 0.63 Cost: 0.65133, accuracy: 0.50000 eta = 0.1 ========> - acc = 0.50 loss = 0.65 Cost: 0.63824, accuracy: 0.50000 eta = 0.0 ==========> 293ms 14675us / step - acc = 0.60 loss = 0.65 Epoch 3/50 Cost: 0.44661, accuracy: 1.00000 eta = 0.3> ---------- acc = 1.00 loss = 0.45 Cost: 0.78060, accuracy: 1.00000 eta = 0.3 => --------- acc = 1.00 loss = 0.78 Cost: 0.79208, accuracy: 1.00000 eta = 0.3 ==> -------- acc = 1.00 loss = 0.79 Cost: 0.49072, accuracy: 0.50000 eta = 0.2 ===> ------- acc = 0.50 loss = 0.49 Cost: 0.62232, accuracy: 1.00000 eta = 0.2 ====> ------ acc = 1.00 loss = 0.62 Cost: 0.82899, accuracy: 1.00000 eta = 0.2 =====> ----- acc = 1.00 loss = 0.83 Cost: 0.67629, accuracy: 0.50000 eta = 0.1 ======> ---- acc = 0.50 loss = 0.68 Cost: 0.62621, accuracy: 0.50000 eta = 0.1 =======> --- acc = 0.50 loss = 0.63 Cost: 0.46077, accuracy: 1.00000 eta = 0.1 ========> - acc = 1.00 loss = 0.46 Cost: 0.62076, accuracy: 1.00000 eta = 0.0 ==========> 304ms 15221us / step - acc = 0.85 loss = 0.63
const video = document.getElementById('cam'); const Layer = 'global_average_pooling2d_1'; const mobilenetInfer = m => (p): tf.Tensor<tf.Rank> => m.infer(p, Layer); const canvas = document.getElementById('canvas'); const scale = document.getElementById('crop'); const ImageSize = { Width: 100, Height: 56 }; navigator.mediaDevices .getUserMedia({ video: true, audio: false }) .then(stream => { video.srcObject = stream; });
video
contains a link to the HTML5 video
element on the page.Layer
contains the name of the layer from MobileNet, from which we want to get the output data and transfer them as input data for our modelmobilenetInfer
- A function that accepts a MobileNet instance and returns another function. The return function takes the input data and returns the corresponding output from the specified MobileNet layer.canvas
points to the element HTML5 canvas
we will use to extract frames from the videoscale
- another one canvas
that is used to scale individual framesvideo
.canvas
and converts its contents: const grayscale = (canvas: HTMLCanvasElement) => { const imageData = canvas.getContext('2d').getImageData(0, 0, canvas.width, canvas.height); const data = imageData.data; for (let i = 0; i < data.length; i += 4) { const avg = (data[i] + data[i + 1] + data[i + 2]) / 3; data[i] = avg; data[i + 1] = avg; data[i + 2] = avg; } canvas.getContext('2d').putImageData(imageData, 0, 0); };
let mobilenet: (p: any) => tf.Tensor<tf.Rank>; tf.loadModel('http://localhost:5000/model.json').then(model => { mobileNet .load() .then((mn: any) => mobilenet = mobilenetInfer(mn)) .then(startInterval(mobilenet, model)); });
mobilenetInfer
to get the path to calculate the output from the hidden network layer. After that we call the method startInterval
with two networks as arguments. const startInterval = (mobilenet, model) => () => { setInterval(() => { canvas.getContext('2d').drawImage(video, 0, 0); grayscale(scale .getContext('2d') .drawImage( canvas, 0, 0, canvas.width, canvas.width / (ImageSize.Width / ImageSize.Height), 0, 0, ImageSize.Width, ImageSize.Height )); const [punching] = Array.from(( model.predict(mobilenet(tf.fromPixels(scale))) as tf.Tensor1D) .dataSync() as Float32Array); const detect = (window as any).Detect; if (punching >= 0.4) detect && detect.onPunch(); }, 100); };
startInterval
! First, we start the interval, where every 100ms
call an anonymous function. It first canvas
renders the video with the current frame. Then we reduce the frame size before 100x56
and apply a gray scale filter to it.predict
our model. That returns a tensor with one element. With the help dataSync
we get the value from the tensor and assign it to a constant punching
.0.4
, then we call the onPunch
global object method Detect
. MK.js provides a global object with three methods:onKick
, onPunch
and onStand
that we can use to control one of the characters. const punches = require('fs') .readdirSync(Punches) .filter(f => f.endsWith('.jpg')) .map(f => `${Punches}/${f}`); const kicks = require('fs') .readdirSync(Kicks) .filter(f => f.endsWith('.jpg')) .map(f => `${Kicks}/${f}`); const others = require('fs') .readdirSync(Others) .filter(f => f.endsWith('.jpg')) .map(f => `${Others}/${f}`); const ys = tf.tensor2d( new Array(punches.length) .fill([1, 0, 0]) .concat(new Array(kicks.length).fill([0, 1, 0])) .concat(new Array(others.length).fill([0, 0, 1])), [punches.length + kicks.length + others.length, 3] ); const xs: tf.Tensor2D = tf.stack( punches .map((path: string) => mobileNet(readInput(path))) .concat(kicks.map((path: string) => mobileNet(readInput(path)))) .concat(others.map((path: string) => mobileNet(readInput(path)))) ) as tf.Tensor2D;
ys
will be n
elements with a value [1, 0, 0]
, m
elements with a value, [0, 1, 0]
and k
elements with a value [0, 0, 1]
.n
elements, in which there are n - 1
elements with a value 0
and one element with a value 1
, we call a unitary vector (one-hot vector).xs
by folding the output of each image from MobileNet. const model = tf.sequential(); model.add(tf.layers.inputLayer({ inputShape: [1024] })); model.add(tf.layers.dense({ units: 1024, activation: 'relu' })); model.add(tf.layers.dense({ units: 3, activation: 'softmax' })); await model.compile({ optimizer: tf.train.adam(1e-6), loss: tf.losses.sigmoidCrossEntropy, metrics: ['accuracy'] });
softmax
, which converts their parameters to a three-value tensor. Why three units for the output layer? Each of the three values for three classes can be represented by two bits: 00
, 01
, 10
. The sum of the values ​​of the tensor created softmax
is 1, that is, we will never get 00, so we cannot classify the images of one of the classes.500
ages, I achieved an accuracy of about 92%! This is not bad, but do not forget that the training was conducted on a small set of data. const [punch, kick, nothing] = Array.from((model.predict( mobilenet(tf.fromPixels(scaled)) ) as tf.Tensor1D).dataSync() as Float32Array); const detect = (window as any).Detect; if (nothing >= 0.4) return; if (kick > punch && kick >= 0.35) { detect.onKick(); return; } if (punch > kick && punch >= 0.35) detect.onPunch();
Float32Array
s dataSync
. In the next step, we use Array.from
to cast a typed array to a JavaScript array. Then we extract the probabilities that there is a punch, a kick or nothing on the frame.0.4
, return. Otherwise, if the probability of a kick is higher 0.32
, send a kick command to MK.js. If the probability of hitting with your hand is higher 0.32
and higher than the probability of hitting with your foot, then send the action of hitting with your hand.n
frames from the video and transfer them to CNN. The CNN output for each frame is passed as input to the RNN. The recurrent neural network will determine the dependencies between the individual frames and recognize which action they correspond to.Source: https://habr.com/ru/post/428019/
All Articles