SlideShare a Scribd company logo
AI Powered Computer
Vision Applications in
Media Industry
Yulia Pavlova
www.linkedin.com/in/yuliapavlovaphd/
Datatalks
7 December 2021
Computer Vision application in media industry
Face Recognition Image understanding Video Analysis
WHY IS IT IMPORTANT?
Metadata (Descriptive metadata) is descriptive information about a resource. It is used for
discovery and identification. It includes elements such as title, abstract, author, and keyword, etc.
Discoverability: Both internal workflow (editors) and external customers will benefit by
enhanced search experience through larger scope of metadata.
Intelligence augmentation: Internal stakeholders (for instance, editors, news distributors )
benefit from intelligent augmentation that proposes what metadata to include.
2
Use case: English football Premier League (EPL)
Pictures editorial workflow load ~2,000 -
20,000 photos per day
1. Images arrive to editors
2. Careful, iterative and predominantly
manual process: selection, editing and
annotating
3. Images are published
4. Images accessible to our customers
Objective:
automate without loss of quality and
gain more speed
3
Use case 1.
FACIAL RECOGNITION
4
Goals
Automate identification of
people in photos to assist
editors and reduce
misidentification in captions
Smoothly integrate into
existing workflow
Foundation for enhancements
Auto-captioning
People-based search
Face
Recognition
pipeline with
AWS co-
development
AWS customer investment program focused on R&D and
Innovation: Co-development together with AWS R&D
team
AWS Step Function based serverless machine learning
pipeline
6
High-level process description
Input Bucket
Output Bucket
Custom 1 : [{"Name":"Roberto
Firmino","Description":"","Confidence":100,"Source":"CelebAP
I","Bounding":{"Width":0.1368750035762787,"Height":0.24333
33396911621,"Left":0.1274999976158142,"Top":0.1022222191
0953522}},{"Name":"Mohamed
Salah","Description":"","Confidence":100,"Source":"CelebAPI",
"Bounding":{"Width":0.13249999284744263,"Height":0.23555
555939674377,"Left":0.4581249952316284,"Top":0.086666665
97127914}},{"Name":"Sadio Mane","Description":"EPL player
on
Liverpool","Confidence":99.97816467285156,"Source":"Custo
mFaces","Bounding":{"Width":0.11625000089406967,"Height":
0.20666666328907013,"Left":0.7193750143051147,"Top":0.08
888889104127884}}]
0.5MB file size (typical of samples)
• 3 faces detected
• 3 identified
• Mostly file transfer time
Result:
Visually
Annotated
Image
Recognition results
Face Recognition tool finds more
people than mentioned in the caption
“Wolverhampton
Wanderers' Raul Jimenez celebrates
scoring their second goal with
teammates”
Over 99,9% confidence on
detected faces
9
False Positive: Object
“A fan with a matchday programme
before the match”
Rekognition Result:
• Shane Long, 0.999
False Positive: Twins
“Cardiff City's Josh Murphy during
the warmup”
Rekognition Result:
Jacob Murphy, 0.99972
• twin brother
• plays for Newcastle United
Visible Face & Caption Mismatch: Subject not visible
Caption:
“Crystal Palace's Andros Townsend
celebrates after the match with teammates”
Rekognition Result:
Christian Benteke, 0.999
Performance Metric
• Testing data = 1504 images:
• Images with 1-to-1 matches (1 face; 1
name)
• Faces identified from custom face
collection (629 EPL headshots) or AWS API
• Accuracy = 99%
• Precision = 99%
• Recall = 100%
Actual
(person in photo)
True False
Predicted
(Rekognition
result)
True 1103 (73.4%) 12
(0.8%)
False 0
(0%)
389
(25.8%)
13
Performance Metric
• Images resolution
• THUMBNAIL (<22Kb)
• THUMBNAIL resolution was not sufficient for face
detection.
• All THUMBNAILs completed in about 1 second.
• VIEWIMAGE (91-154Kb) and BASEIMAGE(1.6-2.4MB) sizes.
• Both BASEIMAGE and VIEWIMAGE resolutions identifies
the face
• BASEIMAGE times varied between 5-9 seconds
• VIEWIMAGE varied between 2-3 seconds
• Recommendation: to use BASEIMAGE resolution for face
identification.
Step Description Timing
1 New image uploaded –
external dependency
?
2 Pipeline starts processing 1 sec
3 Recognition finds and
identifies faces
4 sec
4a Metadata added to
image
4 sec
4b Image is visually
annotated
6 sec
Total Processing 11 sec
14
Use case 2.
IMAGE CAPTIONING
15
“IMAGE CAPTIONING” OBJECTIVE
Original caption:
Everton's Wayne Rooney applauds the fans after
the match
Annotate football images combined with face recognition following with editorial
style
16
Sequence-to-sequence (Seq2Seq) models are deep learning models that
have had some success with image captioning.
Training Data: <scrubbed caption, photo>
“team’s player applauds the fans after the
match”
Model Output:
[team]’s [player] in action with
[team]’s [player]
Model Input:
Model is trained with EPL photos & captions. Heavy preprocessing required.
“Everton’s Wayne Rooney applauds the
fans after the match”
Name Extraction Methods: Accuracy
amz = Amazon Comprehend
core = CoreNLP from Stanford
dm = direct matching using a list of player names
dm_prim = custom collection players
dm_all = master list of names plus remapping of spelling errors
Method Accuracy
dm_prim 0.3412
core_dm_prim 0.8081
core 0.8162
core_amz 0.8596
core_amz_dm_all 0.8609
amz_dm_prim 0.8940
core_dm_all 0.9112
amz 0.9243
amz_dm_all 0.9268
dm_all 0.9938
*highlighted methods do not require manual listing of names
E.g. “Manchester United's Luke Shaw in action with Manchester City's Bernardo Silva”
Framework
and Data
Code
• written in PyTorch framework (torchvision and torch)
Dataset
• ~100 K images with captions (split into train, valid and test,
80%, 10%, 10%), specifically on English football premier
League
Model
• Sequence 2 sequence model with attention
19
Model Architecture
• Sequence -to-sequence model comprises two main parts:
• Encoder (pre-trained CNN);
• (additional) Attention network (selectively focus on different parts of the image)
• Decoder — trainable RNN model.
Player
applauds the
fans after
the match
20
Encoder
Pre-trained Resnet-152:
torchvision: models.resnet152(pretrained=True)
• receives a 224x224 randomly cropped image sample with
transformers applied
transformers: random crop, normalization etc.
• No need to train the encoder
• Output: feature maps of shape (batch, pixels,
num_feature_maps)
21
Encoder
Encoder ResNet-152:
• ignoring adaptive average pooling and linear layer at the bottom
• output from the lowest convolutional block:models = list(resnet.childern())[:-2]
• To prepare the features for decoding, permute the dimensions and reshape it: output size of (batch, 49, 2048)
• Extracts 2048 feature maps with a size of 7x7 each
• Batch size = 64
22
Attention mechanism
Soft Bahdanau (additive) attention that learns attention scores using a feed-forward network during the training
• “emphasis on the most important pixels in the image”
• To focus on relevant parts on each decoding step, the attention network outputs
the context vector (1, 2048), which is the weighted sum of Encoder’s output (features)
“Show, attend and tell: Neural Image Caption Generation with Visual Attention”
23
Decoder
Task: Given the input feature maps X and target captions Y with the length T,
the model learns to accurately predict sequence Y, computing the log
probability P(Y|X)
1. Create embeddings from the target caption -> 2. Initialize the hidden states (h,c) -> 3. Concatenate the
embeddings and context vector into a single input to the LSTM cell -> 4. dropout regularization to a
hidden state h
Given a hidden state h and a previous token y, the model learns to generate the next token in a sequence
24
Decoder
“Team player in action with a team player”
team player in action with a team player
team player in action with a team
25
Train (80K), validate (10K) and test (10K)
Validation metric, BLEU-score on n-grams
• Avg. Loss valid: 0.8426
• Avg. Perplexity valid: 2.4008
• BLEU-1: 0.73
• BLEU-2: 0.60
• BLEU-3: 0.26
• BLEU-4: 0.23
Train with Cross-Entropy loss for n epochs:
Loss function applies softmax to outputs and performs logarithmic operation afterwards
Adam Optimizer with learning rate 10-4
26
Test metric
precision_Rouge_1* 0.787
recall_Rouge_1 0.708
F1_Rouge 0.718
* Precision is important metric –
indicates usage of “original”
editorial words
27
Combining Facial Recognition
& Auto-Captioning
28
Original caption: Everton's Wayne Rooney applauds the
fans after the match
________________________
Face Recognition
• 'Name': 'Wayne Rooney', 'Description’: NA (Celeb API)
Predicted caption (seq2seq):
• team player applauds fans after the match
_________________________
Combined Caption
Wayne Rooney applauds fans after the match
29
Original caption:
• Leicester City's Jamie Vardy during the warm up before the match
___________________________________
Facial Recognition found 4 persons
• 'Name': 'Jamie Vardy', 'Description': 'EPL player on Leicester City’
• 'Name’: 'Wilfred Ndidi', 'Description': 'EPL player on Leicester
City’
• 'Name': 'Ben Chilwell', 'Description': 'EPL player on Leicester City’
• 'Name': 'Dennis Praet', 'Description': 'EPL player on Leicester City’
Predicted caption
• team player and player during the warm up before the match
______________________________
Combined predicted caption:
Leicester City's Jamie Vardy, Wilfred Ndidi, Ben Chilwell and
Dennis Praet during the warm up before the match
30
Original caption:
• Chelsea's Mateo Kovacic and Antonio Rudiger arrive at
the stadium before the match
____________________________________
Face recognition recognized 2 persons:
• Chelsea’s Mateo Kovacic
• Chelsea’s Antonio Rudiger
Predicted caption: team player arrives at the stadium
before the match
_________________________________________
Combined caption:
• Chelsea’s Mateo Kovacic and Antonio Rudiger arrive at
the stadium before the match
31
Challenges in Combining Facial Recognition & Auto-Captioning
Actual Caption:
Brighton & Hove Albion's Leandro Trossard
scores their first goal
Seq2Seq Predicted Caption:
[team]’s [player] scores their [number] goal
Challenges:
• Multiple players identified, but only one
player slot in proposed caption
• Scoring player not identified
Conclusion
• Examples of descriptive information about
images
• Use case of facial recognition and image
captioning applied to English football Premier
League
• AWS framework for custom-trained facial
recognition and pipeline testing
• Custom-trained sequence-to-sequence
model with attention
• Results of combination of both models
33
Useful
information
Lean about AWS rekognition here
https://aws.amazon.com/rekognition/
A great tutorial to seq2seq models for
captioning by Artem Makarov
https://medium.com/analytics-vidhya/image-
captioning-with-attention-part-1-e8a5f783f6d3
“Show, attend and tell” paper about attention
mechanism
https://arxiv.org/pdf/1502.03044.pdf
• Rouge metric
https://en.wikipedia.org/wiki/ROUGE_(metric)
34
Q & A
35

More Related Content

AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova

  • 1. AI Powered Computer Vision Applications in Media Industry Yulia Pavlova www.linkedin.com/in/yuliapavlovaphd/ Datatalks 7 December 2021
  • 2. Computer Vision application in media industry Face Recognition Image understanding Video Analysis WHY IS IT IMPORTANT? Metadata (Descriptive metadata) is descriptive information about a resource. It is used for discovery and identification. It includes elements such as title, abstract, author, and keyword, etc. Discoverability: Both internal workflow (editors) and external customers will benefit by enhanced search experience through larger scope of metadata. Intelligence augmentation: Internal stakeholders (for instance, editors, news distributors ) benefit from intelligent augmentation that proposes what metadata to include. 2
  • 3. Use case: English football Premier League (EPL) Pictures editorial workflow load ~2,000 - 20,000 photos per day 1. Images arrive to editors 2. Careful, iterative and predominantly manual process: selection, editing and annotating 3. Images are published 4. Images accessible to our customers Objective: automate without loss of quality and gain more speed 3
  • 4. Use case 1. FACIAL RECOGNITION 4
  • 5. Goals Automate identification of people in photos to assist editors and reduce misidentification in captions Smoothly integrate into existing workflow Foundation for enhancements Auto-captioning People-based search
  • 6. Face Recognition pipeline with AWS co- development AWS customer investment program focused on R&D and Innovation: Co-development together with AWS R&D team AWS Step Function based serverless machine learning pipeline 6
  • 7. High-level process description Input Bucket Output Bucket Custom 1 : [{"Name":"Roberto Firmino","Description":"","Confidence":100,"Source":"CelebAP I","Bounding":{"Width":0.1368750035762787,"Height":0.24333 33396911621,"Left":0.1274999976158142,"Top":0.1022222191 0953522}},{"Name":"Mohamed Salah","Description":"","Confidence":100,"Source":"CelebAPI", "Bounding":{"Width":0.13249999284744263,"Height":0.23555 555939674377,"Left":0.4581249952316284,"Top":0.086666665 97127914}},{"Name":"Sadio Mane","Description":"EPL player on Liverpool","Confidence":99.97816467285156,"Source":"Custo mFaces","Bounding":{"Width":0.11625000089406967,"Height": 0.20666666328907013,"Left":0.7193750143051147,"Top":0.08 888889104127884}}] 0.5MB file size (typical of samples) • 3 faces detected • 3 identified • Mostly file transfer time
  • 9. Recognition results Face Recognition tool finds more people than mentioned in the caption “Wolverhampton Wanderers' Raul Jimenez celebrates scoring their second goal with teammates” Over 99,9% confidence on detected faces 9
  • 10. False Positive: Object “A fan with a matchday programme before the match” Rekognition Result: • Shane Long, 0.999
  • 11. False Positive: Twins “Cardiff City's Josh Murphy during the warmup” Rekognition Result: Jacob Murphy, 0.99972 • twin brother • plays for Newcastle United
  • 12. Visible Face & Caption Mismatch: Subject not visible Caption: “Crystal Palace's Andros Townsend celebrates after the match with teammates” Rekognition Result: Christian Benteke, 0.999
  • 13. Performance Metric • Testing data = 1504 images: • Images with 1-to-1 matches (1 face; 1 name) • Faces identified from custom face collection (629 EPL headshots) or AWS API • Accuracy = 99% • Precision = 99% • Recall = 100% Actual (person in photo) True False Predicted (Rekognition result) True 1103 (73.4%) 12 (0.8%) False 0 (0%) 389 (25.8%) 13
  • 14. Performance Metric • Images resolution • THUMBNAIL (<22Kb) • THUMBNAIL resolution was not sufficient for face detection. • All THUMBNAILs completed in about 1 second. • VIEWIMAGE (91-154Kb) and BASEIMAGE(1.6-2.4MB) sizes. • Both BASEIMAGE and VIEWIMAGE resolutions identifies the face • BASEIMAGE times varied between 5-9 seconds • VIEWIMAGE varied between 2-3 seconds • Recommendation: to use BASEIMAGE resolution for face identification. Step Description Timing 1 New image uploaded – external dependency ? 2 Pipeline starts processing 1 sec 3 Recognition finds and identifies faces 4 sec 4a Metadata added to image 4 sec 4b Image is visually annotated 6 sec Total Processing 11 sec 14
  • 15. Use case 2. IMAGE CAPTIONING 15
  • 16. “IMAGE CAPTIONING” OBJECTIVE Original caption: Everton's Wayne Rooney applauds the fans after the match Annotate football images combined with face recognition following with editorial style 16
  • 17. Sequence-to-sequence (Seq2Seq) models are deep learning models that have had some success with image captioning. Training Data: <scrubbed caption, photo> “team’s player applauds the fans after the match” Model Output: [team]’s [player] in action with [team]’s [player] Model Input: Model is trained with EPL photos & captions. Heavy preprocessing required. “Everton’s Wayne Rooney applauds the fans after the match”
  • 18. Name Extraction Methods: Accuracy amz = Amazon Comprehend core = CoreNLP from Stanford dm = direct matching using a list of player names dm_prim = custom collection players dm_all = master list of names plus remapping of spelling errors Method Accuracy dm_prim 0.3412 core_dm_prim 0.8081 core 0.8162 core_amz 0.8596 core_amz_dm_all 0.8609 amz_dm_prim 0.8940 core_dm_all 0.9112 amz 0.9243 amz_dm_all 0.9268 dm_all 0.9938 *highlighted methods do not require manual listing of names E.g. “Manchester United's Luke Shaw in action with Manchester City's Bernardo Silva”
  • 19. Framework and Data Code • written in PyTorch framework (torchvision and torch) Dataset • ~100 K images with captions (split into train, valid and test, 80%, 10%, 10%), specifically on English football premier League Model • Sequence 2 sequence model with attention 19
  • 20. Model Architecture • Sequence -to-sequence model comprises two main parts: • Encoder (pre-trained CNN); • (additional) Attention network (selectively focus on different parts of the image) • Decoder — trainable RNN model. Player applauds the fans after the match 20
  • 21. Encoder Pre-trained Resnet-152: torchvision: models.resnet152(pretrained=True) • receives a 224x224 randomly cropped image sample with transformers applied transformers: random crop, normalization etc. • No need to train the encoder • Output: feature maps of shape (batch, pixels, num_feature_maps) 21
  • 22. Encoder Encoder ResNet-152: • ignoring adaptive average pooling and linear layer at the bottom • output from the lowest convolutional block:models = list(resnet.childern())[:-2] • To prepare the features for decoding, permute the dimensions and reshape it: output size of (batch, 49, 2048) • Extracts 2048 feature maps with a size of 7x7 each • Batch size = 64 22
  • 23. Attention mechanism Soft Bahdanau (additive) attention that learns attention scores using a feed-forward network during the training • “emphasis on the most important pixels in the image” • To focus on relevant parts on each decoding step, the attention network outputs the context vector (1, 2048), which is the weighted sum of Encoder’s output (features) “Show, attend and tell: Neural Image Caption Generation with Visual Attention” 23
  • 24. Decoder Task: Given the input feature maps X and target captions Y with the length T, the model learns to accurately predict sequence Y, computing the log probability P(Y|X) 1. Create embeddings from the target caption -> 2. Initialize the hidden states (h,c) -> 3. Concatenate the embeddings and context vector into a single input to the LSTM cell -> 4. dropout regularization to a hidden state h Given a hidden state h and a previous token y, the model learns to generate the next token in a sequence 24
  • 25. Decoder “Team player in action with a team player” team player in action with a team player team player in action with a team 25
  • 26. Train (80K), validate (10K) and test (10K) Validation metric, BLEU-score on n-grams • Avg. Loss valid: 0.8426 • Avg. Perplexity valid: 2.4008 • BLEU-1: 0.73 • BLEU-2: 0.60 • BLEU-3: 0.26 • BLEU-4: 0.23 Train with Cross-Entropy loss for n epochs: Loss function applies softmax to outputs and performs logarithmic operation afterwards Adam Optimizer with learning rate 10-4 26
  • 27. Test metric precision_Rouge_1* 0.787 recall_Rouge_1 0.708 F1_Rouge 0.718 * Precision is important metric – indicates usage of “original” editorial words 27
  • 28. Combining Facial Recognition & Auto-Captioning 28
  • 29. Original caption: Everton's Wayne Rooney applauds the fans after the match ________________________ Face Recognition • 'Name': 'Wayne Rooney', 'Description’: NA (Celeb API) Predicted caption (seq2seq): • team player applauds fans after the match _________________________ Combined Caption Wayne Rooney applauds fans after the match 29
  • 30. Original caption: • Leicester City's Jamie Vardy during the warm up before the match ___________________________________ Facial Recognition found 4 persons • 'Name': 'Jamie Vardy', 'Description': 'EPL player on Leicester City’ • 'Name’: 'Wilfred Ndidi', 'Description': 'EPL player on Leicester City’ • 'Name': 'Ben Chilwell', 'Description': 'EPL player on Leicester City’ • 'Name': 'Dennis Praet', 'Description': 'EPL player on Leicester City’ Predicted caption • team player and player during the warm up before the match ______________________________ Combined predicted caption: Leicester City's Jamie Vardy, Wilfred Ndidi, Ben Chilwell and Dennis Praet during the warm up before the match 30
  • 31. Original caption: • Chelsea's Mateo Kovacic and Antonio Rudiger arrive at the stadium before the match ____________________________________ Face recognition recognized 2 persons: • Chelsea’s Mateo Kovacic • Chelsea’s Antonio Rudiger Predicted caption: team player arrives at the stadium before the match _________________________________________ Combined caption: • Chelsea’s Mateo Kovacic and Antonio Rudiger arrive at the stadium before the match 31
  • 32. Challenges in Combining Facial Recognition & Auto-Captioning Actual Caption: Brighton & Hove Albion's Leandro Trossard scores their first goal Seq2Seq Predicted Caption: [team]’s [player] scores their [number] goal Challenges: • Multiple players identified, but only one player slot in proposed caption • Scoring player not identified
  • 33. Conclusion • Examples of descriptive information about images • Use case of facial recognition and image captioning applied to English football Premier League • AWS framework for custom-trained facial recognition and pipeline testing • Custom-trained sequence-to-sequence model with attention • Results of combination of both models 33
  • 34. Useful information Lean about AWS rekognition here https://aws.amazon.com/rekognition/ A great tutorial to seq2seq models for captioning by Artem Makarov https://medium.com/analytics-vidhya/image- captioning-with-attention-part-1-e8a5f783f6d3 “Show, attend and tell” paper about attention mechanism https://arxiv.org/pdf/1502.03044.pdf • Rouge metric https://en.wikipedia.org/wiki/ROUGE_(metric) 34