Computer vision techniques like facial recognition and image captioning can help automate metadata generation for media companies. Facial recognition can identify people in photos to assist editors and improve searchability, while image captioning can propose captions. A case study of applying these techniques to photos from English Premier League football games achieved 99% accuracy for facial recognition and precision of 78.7% for image captioning. Combining the two allows generating customized captions that include names identified through facial recognition. Challenges remain when the automatic caption does not match details in the image.
Report
Share
Report
Share
1 of 35
Download to read offline
More Related Content
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
1. AI Powered Computer
Vision Applications in
Media Industry
Yulia Pavlova
www.linkedin.com/in/yuliapavlovaphd/
Datatalks
7 December 2021
2. Computer Vision application in media industry
Face Recognition Image understanding Video Analysis
WHY IS IT IMPORTANT?
Metadata (Descriptive metadata) is descriptive information about a resource. It is used for
discovery and identification. It includes elements such as title, abstract, author, and keyword, etc.
Discoverability: Both internal workflow (editors) and external customers will benefit by
enhanced search experience through larger scope of metadata.
Intelligence augmentation: Internal stakeholders (for instance, editors, news distributors )
benefit from intelligent augmentation that proposes what metadata to include.
2
3. Use case: English football Premier League (EPL)
Pictures editorial workflow load ~2,000 -
20,000 photos per day
1. Images arrive to editors
2. Careful, iterative and predominantly
manual process: selection, editing and
annotating
3. Images are published
4. Images accessible to our customers
Objective:
automate without loss of quality and
gain more speed
3
5. Goals
Automate identification of
people in photos to assist
editors and reduce
misidentification in captions
Smoothly integrate into
existing workflow
Foundation for enhancements
Auto-captioning
People-based search
6. Face
Recognition
pipeline with
AWS co-
development
AWS customer investment program focused on R&D and
Innovation: Co-development together with AWS R&D
team
AWS Step Function based serverless machine learning
pipeline
6
7. High-level process description
Input Bucket
Output Bucket
Custom 1 : [{"Name":"Roberto
Firmino","Description":"","Confidence":100,"Source":"CelebAP
I","Bounding":{"Width":0.1368750035762787,"Height":0.24333
33396911621,"Left":0.1274999976158142,"Top":0.1022222191
0953522}},{"Name":"Mohamed
Salah","Description":"","Confidence":100,"Source":"CelebAPI",
"Bounding":{"Width":0.13249999284744263,"Height":0.23555
555939674377,"Left":0.4581249952316284,"Top":0.086666665
97127914}},{"Name":"Sadio Mane","Description":"EPL player
on
Liverpool","Confidence":99.97816467285156,"Source":"Custo
mFaces","Bounding":{"Width":0.11625000089406967,"Height":
0.20666666328907013,"Left":0.7193750143051147,"Top":0.08
888889104127884}}]
0.5MB file size (typical of samples)
• 3 faces detected
• 3 identified
• Mostly file transfer time
9. Recognition results
Face Recognition tool finds more
people than mentioned in the caption
“Wolverhampton
Wanderers' Raul Jimenez celebrates
scoring their second goal with
teammates”
Over 99,9% confidence on
detected faces
9
10. False Positive: Object
“A fan with a matchday programme
before the match”
Rekognition Result:
• Shane Long, 0.999
11. False Positive: Twins
“Cardiff City's Josh Murphy during
the warmup”
Rekognition Result:
Jacob Murphy, 0.99972
• twin brother
• plays for Newcastle United
12. Visible Face & Caption Mismatch: Subject not visible
Caption:
“Crystal Palace's Andros Townsend
celebrates after the match with teammates”
Rekognition Result:
Christian Benteke, 0.999
13. Performance Metric
• Testing data = 1504 images:
• Images with 1-to-1 matches (1 face; 1
name)
• Faces identified from custom face
collection (629 EPL headshots) or AWS API
• Accuracy = 99%
• Precision = 99%
• Recall = 100%
Actual
(person in photo)
True False
Predicted
(Rekognition
result)
True 1103 (73.4%) 12
(0.8%)
False 0
(0%)
389
(25.8%)
13
14. Performance Metric
• Images resolution
• THUMBNAIL (<22Kb)
• THUMBNAIL resolution was not sufficient for face
detection.
• All THUMBNAILs completed in about 1 second.
• VIEWIMAGE (91-154Kb) and BASEIMAGE(1.6-2.4MB) sizes.
• Both BASEIMAGE and VIEWIMAGE resolutions identifies
the face
• BASEIMAGE times varied between 5-9 seconds
• VIEWIMAGE varied between 2-3 seconds
• Recommendation: to use BASEIMAGE resolution for face
identification.
Step Description Timing
1 New image uploaded –
external dependency
?
2 Pipeline starts processing 1 sec
3 Recognition finds and
identifies faces
4 sec
4a Metadata added to
image
4 sec
4b Image is visually
annotated
6 sec
Total Processing 11 sec
14
16. “IMAGE CAPTIONING” OBJECTIVE
Original caption:
Everton's Wayne Rooney applauds the fans after
the match
Annotate football images combined with face recognition following with editorial
style
16
17. Sequence-to-sequence (Seq2Seq) models are deep learning models that
have had some success with image captioning.
Training Data: <scrubbed caption, photo>
“team’s player applauds the fans after the
match”
Model Output:
[team]’s [player] in action with
[team]’s [player]
Model Input:
Model is trained with EPL photos & captions. Heavy preprocessing required.
“Everton’s Wayne Rooney applauds the
fans after the match”
18. Name Extraction Methods: Accuracy
amz = Amazon Comprehend
core = CoreNLP from Stanford
dm = direct matching using a list of player names
dm_prim = custom collection players
dm_all = master list of names plus remapping of spelling errors
Method Accuracy
dm_prim 0.3412
core_dm_prim 0.8081
core 0.8162
core_amz 0.8596
core_amz_dm_all 0.8609
amz_dm_prim 0.8940
core_dm_all 0.9112
amz 0.9243
amz_dm_all 0.9268
dm_all 0.9938
*highlighted methods do not require manual listing of names
E.g. “Manchester United's Luke Shaw in action with Manchester City's Bernardo Silva”
19. Framework
and Data
Code
• written in PyTorch framework (torchvision and torch)
Dataset
• ~100 K images with captions (split into train, valid and test,
80%, 10%, 10%), specifically on English football premier
League
Model
• Sequence 2 sequence model with attention
19
20. Model Architecture
• Sequence -to-sequence model comprises two main parts:
• Encoder (pre-trained CNN);
• (additional) Attention network (selectively focus on different parts of the image)
• Decoder — trainable RNN model.
Player
applauds the
fans after
the match
20
22. Encoder
Encoder ResNet-152:
• ignoring adaptive average pooling and linear layer at the bottom
• output from the lowest convolutional block:models = list(resnet.childern())[:-2]
• To prepare the features for decoding, permute the dimensions and reshape it: output size of (batch, 49, 2048)
• Extracts 2048 feature maps with a size of 7x7 each
• Batch size = 64
22
23. Attention mechanism
Soft Bahdanau (additive) attention that learns attention scores using a feed-forward network during the training
• “emphasis on the most important pixels in the image”
• To focus on relevant parts on each decoding step, the attention network outputs
the context vector (1, 2048), which is the weighted sum of Encoder’s output (features)
“Show, attend and tell: Neural Image Caption Generation with Visual Attention”
23
24. Decoder
Task: Given the input feature maps X and target captions Y with the length T,
the model learns to accurately predict sequence Y, computing the log
probability P(Y|X)
1. Create embeddings from the target caption -> 2. Initialize the hidden states (h,c) -> 3. Concatenate the
embeddings and context vector into a single input to the LSTM cell -> 4. dropout regularization to a
hidden state h
Given a hidden state h and a previous token y, the model learns to generate the next token in a sequence
24
25. Decoder
“Team player in action with a team player”
team player in action with a team player
team player in action with a team
25
26. Train (80K), validate (10K) and test (10K)
Validation metric, BLEU-score on n-grams
• Avg. Loss valid: 0.8426
• Avg. Perplexity valid: 2.4008
• BLEU-1: 0.73
• BLEU-2: 0.60
• BLEU-3: 0.26
• BLEU-4: 0.23
Train with Cross-Entropy loss for n epochs:
Loss function applies softmax to outputs and performs logarithmic operation afterwards
Adam Optimizer with learning rate 10-4
26
29. Original caption: Everton's Wayne Rooney applauds the
fans after the match
________________________
Face Recognition
• 'Name': 'Wayne Rooney', 'Description’: NA (Celeb API)
Predicted caption (seq2seq):
• team player applauds fans after the match
_________________________
Combined Caption
Wayne Rooney applauds fans after the match
29
30. Original caption:
• Leicester City's Jamie Vardy during the warm up before the match
___________________________________
Facial Recognition found 4 persons
• 'Name': 'Jamie Vardy', 'Description': 'EPL player on Leicester City’
• 'Name’: 'Wilfred Ndidi', 'Description': 'EPL player on Leicester
City’
• 'Name': 'Ben Chilwell', 'Description': 'EPL player on Leicester City’
• 'Name': 'Dennis Praet', 'Description': 'EPL player on Leicester City’
Predicted caption
• team player and player during the warm up before the match
______________________________
Combined predicted caption:
Leicester City's Jamie Vardy, Wilfred Ndidi, Ben Chilwell and
Dennis Praet during the warm up before the match
30
31. Original caption:
• Chelsea's Mateo Kovacic and Antonio Rudiger arrive at
the stadium before the match
____________________________________
Face recognition recognized 2 persons:
• Chelsea’s Mateo Kovacic
• Chelsea’s Antonio Rudiger
Predicted caption: team player arrives at the stadium
before the match
_________________________________________
Combined caption:
• Chelsea’s Mateo Kovacic and Antonio Rudiger arrive at
the stadium before the match
31
32. Challenges in Combining Facial Recognition & Auto-Captioning
Actual Caption:
Brighton & Hove Albion's Leandro Trossard
scores their first goal
Seq2Seq Predicted Caption:
[team]’s [player] scores their [number] goal
Challenges:
• Multiple players identified, but only one
player slot in proposed caption
• Scoring player not identified
33. Conclusion
• Examples of descriptive information about
images
• Use case of facial recognition and image
captioning applied to English football Premier
League
• AWS framework for custom-trained facial
recognition and pipeline testing
• Custom-trained sequence-to-sequence
model with attention
• Results of combination of both models
33
34. Useful
information
Lean about AWS rekognition here
https://aws.amazon.com/rekognition/
A great tutorial to seq2seq models for
captioning by Artem Makarov
https://medium.com/analytics-vidhya/image-
captioning-with-attention-part-1-e8a5f783f6d3
“Show, attend and tell” paper about attention
mechanism
https://arxiv.org/pdf/1502.03044.pdf
• Rouge metric
https://en.wikipedia.org/wiki/ROUGE_(metric)
34