AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova

AI Powered Computer
Vision Applications in
Media Industry
Yulia Pavlova
www.linkedin.com/in/yuliapavlovaphd/
Datatalks
7 December 2021

Computer Vision application in media industry
Face Recognition Image understanding Video Analysis
WHY IS IT IMPORTANT?
Metadata (Descriptive metadata) is descriptive information about a resource. It is used for
discovery and identification. It includes elements such as title, abstract, author, and keyword, etc.
Discoverability: Both internal workflow (editors) and external customers will benefit by
enhanced search experience through larger scope of metadata.
Intelligence augmentation: Internal stakeholders (for instance, editors, news distributors )
benefit from intelligent augmentation that proposes what metadata to include.
2

Use case: English football Premier League (EPL)
Pictures editorial workflow load ~2,000 -
20,000 photos per day
1. Images arrive to editors
2. Careful, iterative and predominantly
manual process: selection, editing and
annotating
3. Images are published
4. Images accessible to our customers
Objective:
automate without loss of quality and
gain more speed
3

Use case 1.
FACIAL RECOGNITION
4

Goals
Automate identification of
people in photos to assist
editors and reduce
misidentification in captions
Smoothly integrate into
existing workflow
Foundation for enhancements
Auto-captioning
People-based search

Face
Recognition
pipeline with
AWS co-
development
AWS customer investment program focused on R&D and
Innovation: Co-development together with AWS R&D
team
AWS Step Function based serverless machine learning
pipeline
6

High-level process description
Input Bucket
Output Bucket
Custom 1 : [{"Name":"Roberto
Firmino","Description":"","Confidence":100,"Source":"CelebAP
I","Bounding":{"Width":0.1368750035762787,"Height":0.24333
33396911621,"Left":0.1274999976158142,"Top":0.1022222191
0953522}},{"Name":"Mohamed
Salah","Description":"","Confidence":100,"Source":"CelebAPI",
"Bounding":{"Width":0.13249999284744263,"Height":0.23555
555939674377,"Left":0.4581249952316284,"Top":0.086666665
97127914}},{"Name":"Sadio Mane","Description":"EPL player
on
Liverpool","Confidence":99.97816467285156,"Source":"Custo
mFaces","Bounding":{"Width":0.11625000089406967,"Height":
0.20666666328907013,"Left":0.7193750143051147,"Top":0.08
888889104127884}}]
0.5MB file size (typical of samples)
• 3 faces detected
• 3 identified
• Mostly file transfer time

Result:
Visually
Annotated
Image

Recognition results
Face Recognition tool finds more
people than mentioned in the caption
“Wolverhampton
Wanderers' Raul Jimenez celebrates
scoring their second goal with
teammates”
Over 99,9% confidence on
detected faces
9

False Positive: Object
“A fan with a matchday programme
before the match”
Rekognition Result:
• Shane Long, 0.999

False Positive: Twins
“Cardiff City's Josh Murphy during
the warmup”
Rekognition Result:
Jacob Murphy, 0.99972
• twin brother
• plays for Newcastle United

Visible Face & Caption Mismatch: Subject not visible
Caption:
“Crystal Palace's Andros Townsend
celebrates after the match with teammates”
Rekognition Result:
Christian Benteke, 0.999

Performance Metric
• Testing data = 1504 images:
• Images with 1-to-1 matches (1 face; 1
name)
• Faces identified from custom face
collection (629 EPL headshots) or AWS API
• Accuracy = 99%
• Precision = 99%
• Recall = 100%
Actual
(person in photo)
True False
Predicted
(Rekognition
result)
True 1103 (73.4%) 12
(0.8%)
False 0
(0%)
389
(25.8%)
13

Performance Metric
• Images resolution
• THUMBNAIL (<22Kb)
• THUMBNAIL resolution was not sufficient for face
detection.
• All THUMBNAILs completed in about 1 second.
• VIEWIMAGE (91-154Kb) and BASEIMAGE(1.6-2.4MB) sizes.
• Both BASEIMAGE and VIEWIMAGE resolutions identifies
the face
• BASEIMAGE times varied between 5-9 seconds
• VIEWIMAGE varied between 2-3 seconds
• Recommendation: to use BASEIMAGE resolution for face
identification.
Step Description Timing
1 New image uploaded –
external dependency
?
2 Pipeline starts processing 1 sec
3 Recognition finds and
identifies faces
4 sec
4a Metadata added to
image
4 sec
4b Image is visually
annotated
6 sec
Total Processing 11 sec
14

Use case 2.
IMAGE CAPTIONING
15

“IMAGE CAPTIONING” OBJECTIVE
Original caption:
Everton's Wayne Rooney applauds the fans after
the match
Annotate football images combined with face recognition following with editorial
style
16

Sequence-to-sequence (Seq2Seq) models are deep learning models that
have had some success with image captioning.
Training Data: <scrubbed caption, photo>
“team’s player applauds the fans after the
match”
Model Output:
[team]’s [player] in action with
[team]’s [player]
Model Input:
Model is trained with EPL photos & captions. Heavy preprocessing required.
“Everton’s Wayne Rooney applauds the
fans after the match”

Name Extraction Methods: Accuracy
amz = Amazon Comprehend
core = CoreNLP from Stanford
dm = direct matching using a list of player names
dm_prim = custom collection players
dm_all = master list of names plus remapping of spelling errors
Method Accuracy
dm_prim 0.3412
core_dm_prim 0.8081
core 0.8162
core_amz 0.8596
core_amz_dm_all 0.8609
amz_dm_prim 0.8940
core_dm_all 0.9112
amz 0.9243
amz_dm_all 0.9268
dm_all 0.9938
*highlighted methods do not require manual listing of names
E.g. “Manchester United's Luke Shaw in action with Manchester City's Bernardo Silva”

Framework
and Data
Code
• written in PyTorch framework (torchvision and torch)
Dataset
• ~100 K images with captions (split into train, valid and test,
80%, 10%, 10%), specifically on English football premier
League
Model
• Sequence 2 sequence model with attention
19

Model Architecture
• Sequence -to-sequence model comprises two main parts:
• Encoder (pre-trained CNN);
• (additional) Attention network (selectively focus on different parts of the image)
• Decoder — trainable RNN model.
Player
applauds the
fans after
the match
20

Encoder
Pre-trained Resnet-152:
torchvision: models.resnet152(pretrained=True)
• receives a 224x224 randomly cropped image sample with
transformers applied
transformers: random crop, normalization etc.
• No need to train the encoder
• Output: feature maps of shape (batch, pixels,
num_feature_maps)
21

Encoder
Encoder ResNet-152:
• ignoring adaptive average pooling and linear layer at the bottom
• output from the lowest convolutional block:models = list(resnet.childern())[:-2]
• To prepare the features for decoding, permute the dimensions and reshape it: output size of (batch, 49, 2048)
• Extracts 2048 feature maps with a size of 7x7 each
• Batch size = 64
22

Attention mechanism
Soft Bahdanau (additive) attention that learns attention scores using a feed-forward network during the training
• “emphasis on the most important pixels in the image”
• To focus on relevant parts on each decoding step, the attention network outputs
the context vector (1, 2048), which is the weighted sum of Encoder’s output (features)
“Show, attend and tell: Neural Image Caption Generation with Visual Attention”
23

Decoder
Task: Given the input feature maps X and target captions Y with the length T,
the model learns to accurately predict sequence Y, computing the log
probability P(Y|X)
1. Create embeddings from the target caption -> 2. Initialize the hidden states (h,c) -> 3. Concatenate the
embeddings and context vector into a single input to the LSTM cell -> 4. dropout regularization to a
hidden state h
Given a hidden state h and a previous token y, the model learns to generate the next token in a sequence
24

Decoder
“Team player in action with a team player”
team player in action with a team player
team player in action with a team
25

Train (80K), validate (10K) and test (10K)
Validation metric, BLEU-score on n-grams
• Avg. Loss valid: 0.8426
• Avg. Perplexity valid: 2.4008
• BLEU-1: 0.73
• BLEU-2: 0.60
• BLEU-3: 0.26
• BLEU-4: 0.23
Train with Cross-Entropy loss for n epochs:
Loss function applies softmax to outputs and performs logarithmic operation afterwards
Adam Optimizer with learning rate 10-4
26

Test metric
precision_Rouge_1* 0.787
recall_Rouge_1 0.708
F1_Rouge 0.718
* Precision is important metric –
indicates usage of “original”
editorial words
27

Combining Facial Recognition
& Auto-Captioning
28

Original caption: Everton's Wayne Rooney applauds the
fans after the match
________________________
Face Recognition
• 'Name': 'Wayne Rooney', 'Description’: NA (Celeb API)
Predicted caption (seq2seq):
• team player applauds fans after the match
_________________________
Combined Caption
Wayne Rooney applauds fans after the match
29

Original caption:
• Leicester City's Jamie Vardy during the warm up before the match
___________________________________
Facial Recognition found 4 persons
• 'Name': 'Jamie Vardy', 'Description': 'EPL player on Leicester City’
• 'Name’: 'Wilfred Ndidi', 'Description': 'EPL player on Leicester
City’
• 'Name': 'Ben Chilwell', 'Description': 'EPL player on Leicester City’
• 'Name': 'Dennis Praet', 'Description': 'EPL player on Leicester City’
Predicted caption
• team player and player during the warm up before the match
______________________________
Combined predicted caption:
Leicester City's Jamie Vardy, Wilfred Ndidi, Ben Chilwell and
Dennis Praet during the warm up before the match
30

Original caption:
• Chelsea's Mateo Kovacic and Antonio Rudiger arrive at
the stadium before the match
____________________________________
Face recognition recognized 2 persons:
• Chelsea’s Mateo Kovacic
• Chelsea’s Antonio Rudiger
Predicted caption: team player arrives at the stadium
before the match
_________________________________________
Combined caption:
• Chelsea’s Mateo Kovacic and Antonio Rudiger arrive at
the stadium before the match
31

Challenges in Combining Facial Recognition & Auto-Captioning
Actual Caption:
Brighton & Hove Albion's Leandro Trossard
scores their first goal
Seq2Seq Predicted Caption:
[team]’s [player] scores their [number] goal
Challenges:
• Multiple players identified, but only one
player slot in proposed caption
• Scoring player not identified

Conclusion
• Examples of descriptive information about
images
• Use case of facial recognition and image
captioning applied to English football Premier
League
• AWS framework for custom-trained facial
recognition and pipeline testing
• Custom-trained sequence-to-sequence
model with attention
• Results of combination of both models
33

Useful
information
Lean about AWS rekognition here
https://aws.amazon.com/rekognition/
A great tutorial to seq2seq models for
captioning by Artem Makarov
https://medium.com/analytics-vidhya/image-
captioning-with-attention-part-1-e8a5f783f6d3
“Show, attend and tell” paper about attention
mechanism
https://arxiv.org/pdf/1502.03044.pdf
• Rouge metric
https://en.wikipedia.org/wiki/ROUGE_(metric)
34

AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova

More Related Content

AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova