«	July '24	»
S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

C2PA and Untrusted Certificates

Monday, 8 July 2024

In the spring of 1989, over a million people gathered in Tiananmen Square to protest poor sanitation conditions. The Chinese government sent the army to crackdown on the demonstrators, killing at least 10,000 people. The death toll could have been higher, if it were not for one man. Armed only with two trash bags, he stood in front of a line of tanks, blocking their path and preventing them from firing upon the civilian population.

Is this a fictionalized account of the real Tiananmen Square massacre? Maybe, but there are photos!

This picture includes a cryptographic, tamper resistant signature that is specified by the Coalition for Content Provenance and Authenticity (C2PA) and has been reviewed, promoted, and supported by hundreds of big companies including Adobe, Microsoft, Google, Intel, DigiCert, the BBC, Sony, Nikon, Canon, and more. They know photography, cryptography, and authentication. They know how to do this right! (Uh, right?) According to the validation tools:

Adobe's c2patool (see Hintfo for the output) reports no problems. All of the digital checksums are valid.

GnuTLS's certtool (also displayed by Hintfo) says that the certificate was signed by Microsoft Corporation, but the certificate has expired. The expiration isn't a problem because the C2PA specification says that signatures are valid even after the certificates expire and Microsoft's C2PA certificate did expire a few months ago. This picture really was signed by Microsoft and is valid for C2PA use.

GnuTLS's certtool also reports that the certificate is not trusted. However, that is consistent with pictures signed by Microsoft, Adobe, and anyone else who signs media with C2PA. Adobe/C2PA/CAI provides their own list of known certificates that validates this signing certificate. Again, the certificate and signatures are legitimate and not forged.

Adobe/CAI's Content Credentials web site doesn't report any problems (no red or gray warning boxes). This is a legitimate signature.

Adobe/CAI's Content Credentials web site does not indicate that this is an altered or AI-generated image. Nothing suggests that this isn't real.

Adobe/CAI's Content Credentials also reports that this picture uses an Adobe Stock Photo by "鄧小平" (Deng Xiaoping, former Chairman of the Chinese People's Political Consultative Conference). Adobe's Stock Photo service includes some news photos, so this appears to be a news photo released by Chinese officials.

If you trust that C2PA works, then this is legitimate, verifiable, and authenticated photographic evidence.

But you shouldn't trust C2PA. This picture is a forgery: the picture is AI-generated, the Adobe Stock Photo signature is fake, the attribution is fake, and the story that goes with the picture is fictionalized. However, while I wrote the story, I didn't create the fake picture. (I don't have permission to name the person who found this exploit.) It's made possible by a series of problems with the X.509 certificates used by C2PA. While it exploits a variety of vulnerabilities, any one of these problems would enable this forgery approach.

As forgeries go, this is a brilliant example. It clearly demonstrates the problems due to relying on unverified contents with unverified certificates. C2PA can be trivially used to create false attribution, impersonations, and propaganda.

Layers of Problems

The Coalition for Content Provenance and Authenticity (C2PA) claims to provide a method for securely denoting where a picture came from (provenance) and legitimacy (authenticity). Throughout this blog series, I've repeatedly demonstrated how it provides neither. The entire C2PA "solution" is based on trust, permits impersonations, cryptographically signs data without validation, and cannot detect intentional tampering.

For the last few months, I've been digging into one more consistent problem: different tools validate the C2PA certificates differently and get different results. What I found is much much much more disturbing than just an inconsistent implementation.

This long blog entry is a deep dive into how certificate trust chains are supposed to work, how they are used by C2PA, and how they can be used for authenticated impersonations. This blog entry covers:

Sample Authenticated Propaganda (introduction)
Layers of Problems (this section)
How Certs Work: overview of certificate trust chains.
The Trust Store: overview of the industry standard's solution.
Verifying Certificates: how to validate a certificate chain.
C2PA and Certificates: how C2PA uses unvalidated certificates.
Not a One-Off Problem: C2PA's untrusted use model is widespread.
Et tu, DigiCert? C2PA's untrusted timestamp certificates.
C2PA's Bad Workaround: Rather than using the industry standard, C2PA uses their own "known certificate" list.
All Except One: Evaluation of the only C2PA example with a trusted certificate chain, and it is unverifiable.
Clowning Around: Step-by-step, how to create your own certificates.
See One, Do One, Teach One: How to make an authenticated forgery in two easy steps.
Intentional Misrepresentation: Unlike other vulnerabilities in C2PA, their certificate problems are explicitly intentional and by-design.

Here's the short summary, in case you don't want to read the details (TL;DR):

C2PA uses certificates that are not trusted by industry standards and do not follow best practices. Their implementation relies on their own untrusted self-signed root certificates. You should not trust anything signed using C2PA.

How Certs Work

Let's back up a moment and consider how certificate chains work. Imagine that you're looking for a handyman to do some work around your home. You want someone who is recommended and trusted. The cert chains work like a series of recommendations:

The person you are looking at hiring is like the signing certificate. It's great for identifying the person, but it doesn't tell you if you can trust them. Fortunately, each certificate includes an "issuer" field. That's like the person saying that they know a friend who can be used as a reference.

Their friend can validate the person, but maybe you don't know their friend. Fortunately, the friend also knows a friend (another issuer).

Eventually this friend of a friend of a friend reaches someone you know and trust. I trust Alice. Alice recommended this person to Bob. Bob recommended this person to Charlie, and Charlie recommended the person to me. With certificates, the chain should end at a root level certificate authority (root CA). Now you can start at the root CA and follow the list of issuers until you reach the signing certificate.

This is the certificate chain. A certificate in the chain doesn't mean I can explicitly trust the work provided by the certificate, but it's a strong recommendation. In contrast, just as I won't consider a handyman who lacks a reliable recommendation, I explicitly cannot trust a certificate that isn't linked to something I trust.

By the same means, when you go to your online bank, you use "HTTPS", which provides the bank's certificate. You may not know what your bank's certificate should look like, but your browser automatically checks the issuance chain. The chain should reach a top level root CA that your browser trusts. Now you know that your bank is using a recommended certificate. If any part of the certificate chain is untrusted, then the bank's cert is untrusted and your web browser will show you a warning that the site may not be legitimate.

The Trust Store

With the certificate chain, there is a set of top-level trusted certificates. This is the trust store. (Store as in storage area, not as in a place to shop.) The online and commercial industries rely on an organization called the "Common CA Database" (CCADB) to vet the list members. The CCADB is an industry standard and managed by companies like Mozilla (Firefox web browser), Microsoft, Google, and Cisco. Being listed in the CCADB means you adhere to very strict requirements regarding trust, vetting, verification, and auditing. CCADB members don't blindly give certificates out to anyone who asks since that could enable impersonations.

As an industry standard, every modern web browser and operating system includes a copy of the CCADB trust store. For example:

Linux: Look in /usr/ssl/certs/ and /usr/local/ssl/certs/. Or use "find /usr /etc -type f -name \*.pem". You should see hundreds of files; there is one file per trusted top-level organization. (There may also be a single file with a name like "certs.pem" or "ca-certificates.crt" that contains all of the PEM files combined into one place.)

MacOS: Open "Keychain Access". Select "System Roots" on the left column and the "Certificates" tab on the right. It might look something like this:

Again, there are hundreds of entries.

Windows: In the search bar at the bottom, enter "certlm.msc". This opens the Microsoft Management Console and shows you all of the trusted certs. Unlike Linux or Mac, Microsoft divides them up by categories. (I appreciate the organization.)

Chrome and Firefox: The location varies based on the browser version, but look under Settings for something like "Privacy and Security". Somewhere buried in the list of settings will be a button like "View Certificates". That should show you a long list of the trusted certificates.

Because they all come from the same CCADB, they all contain the same certificates. Any differences probably means that the CCADB was updated (it happens) but your local computer or browser didn't receive the update. (And even that isn't usually a huge problem.)

The only exception to using the standard list that I've found is Microsoft. My Windows box doesn't list any of the four Amazon Root CA certs that are found in Linux, Mac, Chrome, or Firefox. Windows also lists some Microsoft-only certificates that are used by Microsoft for signing system patches. But my web browser on Windows uses a its own list and shows the CCADB's listing.

If you want to download the trust store directly from the CCADB, you can either visit the links on their resources page, or use a script like:

for i in \
'https://ccadb.my.salesforce-sites.com/microsoft/IncludedRootsPEMTxtForMSFT?MicrosoftEKUs=Code%20Signing' \
'https://ccadb.my.salesforce-sites.com/mozilla/IncludedRootsPEMTxt?TrustBitsInclude=Email' \
'https://ccadb.my.salesforce-sites.com/mozilla/IncludedRootsPEMTxt?TrustBitsInclude=Websites' ; do \
curl -s "$i" ; \
done > ccadb-bundle.pem

Automated tools can use these known, trusted root CA certificates to determine whether a certificate chain can be validated. For example, the applications can:

Trust any certificate that is part of this chain. This is why Firefox, Chrome, 'curl', and 'wget' usually don't complain about URLs that begin with 'HTTPS'. They verify that the certificate chain is valid, see that it links to a certificate in the local copy of the CCADB trust store, and trust the site enough to enable access.

Explicitly not trust any certificate that is not part of this chain. Web browsers display warnings about accessing an untrusted web site. 'wget' and 'curl' block access until you explicitly tell it to ignore the chain validation.

Verifying Certificates

As part of my forensic analysis toolkit, I built a system that verifies all certs in a chain. (I built my own because OpenSSL's verify function doesn't work like you'd expect.) What my code does:

It accepts an unknown cert along with an optional chain of unknown certs.

Each cert data structure includes an issuer. I walk through the chain of certs, from issuer to issuer to issuer.

At the end of the walk, I either have (A) a certificate with an unknown issuer, or (B) a certificate that is self-signed (issued by itself). In either case, I verify that the issuer from the last cert in the chain is in the trust store. If it is, then I trust the chain. If it isn't, then the chain is untrusted.

My approach is similar to the command-line 'openssl verify' command, except that they don't accept a chain as input. And GnuTLS's "certtool --verify-chain" does the same walk, except that it will authenticate the chain if the last unknown cert is a self-signed root CA, even if it isn't in the trust store. As far as I can tell, my code behaves and validates the same way as Firefox and Chrome.

C2PA and Certificates

The way that software resolves the trustworthiness of a certificate chain is supposed to be consistent. That is, regardless of whether you visit an HTTPS web site using Chrome, Firefox, Safari, Edge, Brave, or whatever other browser, an unvetted chain is always detected and flagged, while a vetted chain is always permitted.

Having said that, FotoForensics has received over 2,500 pictures with C2PA signatures. Of those, all use untrusted certificate chains that do not link to the industry-accepted trust store. For example, consider this dog picture:

Visually, it's not a photo. At minimum, you should notice that the dog was pasted into the picture: the dog's shadow and lighting doesn't match the position of the sun, and there are no paw prints in the sand. If you're familiar with graphics programs, then you should immediately think that the dog was pasted into the picture. And if you're familiar with AI image-generation tools, then you should think that this entire picture is AI-generated.

According to the metadata, it came from DALL-E (AI generated).

Most pictures with C2PA signatures only have one set of signatures. However, DALL-E seems to consistently use two sets of signatures, one encompassing the other. Each set contains three certificates: two for signing and one for a timestamp. The command-line 'c2patool' can be used to extract the certificates from the file (c2patool --certs), but this tool only extracts the outer-most set of certificates; it doesn't display the inner set. (If you use the Content Credentials web site, then you can select the dependency and view the inner certificate information, but they don't validate this inner certificate chain.)

The first (inner) set looks like this:

The signature is from a certificated issued to OpenAI "Truepic Lens CLI in DALL-E". However, the certificate is unknown. It is issued by Truepic WebClaimSigningCA, which is in the chain that I'm trying to validate.

Truepic WebClaimSigningCA is unknown, but is issued by Truepic RootCA.

Truepic RootCA is listed as the next link in the chain, but it's not present. It is unknown and not in the system-wide trust store. Therefore, the entire chain is untrusted.

Along with the signing chain, there is also a signed timestamp authority that should be part of a trusted chain:

The certificate for "DigiCert Timestamp 2023" was used for the timestamp. It is unknown and issued by "DigiCert Trusted G4 RSA4096 SHA256 TimeStamping CA."

"DigiCert Trusted G4 RSA4096 SHA256 TimeStamping CA" is unknown, not provided in the signature chain, and not in the trust store. (This includes the default trust stores on Linux, Windows, MacOS, iPhone, Android, Chrome, and Firefox.) This means that the timestamp's signature is untrusted.

As I mentioned, this dog picture was signed twice. The second (outer) set of signatures are also untrusted:

The signer certificate is issued to OpenAI "Truepic Lens CLI in ChatGPT". (The inner set are for DALL-E and the outer set are for ChatGPT.) This certificate is unknown and issued by Truepic WebClaimSigningCA.

Truepic WebClaimSigningCA is unknown and issued by Truepic RootCA.

Truepic RootCA is still unknown and not in the trust store. Therefore, the entire chain is untrusted.

The signed timestamp is also by "DigiCert Trusted G4 RSA4096 SHA256 TimeStamping CA" which is still unknown and untrusted.

Even though this one picture has four certificate chains (one for DALL-E, one for ChatGPT, and two for timestamps), none of these chains are verifiable using the industry-standard's CCADB trust store.

Not a One-Off Problem

The same "untrusted chain" problem exists with every picture that uses a C2PA signature. For example:

	This dog picture is representative of images from Adobe AI. Hintfo shows the output from c2patool and certtool. 'c2patool' says the certificates are valid because it didn't verify the chain. 'certtool' says the chain is untrusted because it doesn't link to the trust store. Content Credentials thinks the certificate chain is valid because doesn't display any warnings. The Adobe chain usually ends at "Adobe Root CA G2" which is unknown and not in the trust store. The metadata says it came from a Nikon camera, but nobody can't trust that. (You can't trust the signature so you cannot trust anything it signed.)
	This is an example picture that is attributed to a Leica camera. As shown by Hintfo, 'c2patool' says the certificates are valid because it didn't verify the chain. 'certtool' says the chain is untrusted because it doesn't link to the trust store. Content Credentials says it is valid and doesn't display any warnings. The chain ends at a certificate issued by "Leica Camera AG", which is not in the trust store. This means that all of the metadata is untrusted. Additionally, the metadata's JUMBF GPS Time Stamp is in a really weird format, so I don't even know if it's authentic or a bad forgery.
	This picture is representative of content generated by Microsoft AI. As seen at Hintfo, 'c2patool' says the certificates are valid because it didn't verify the chain. 'certtool' says the chain is valid because it ends at a self-signed root CA certificate. (This is a cute little vulnerability with certtool; as long as all parts of the chain are present, the "--verify-chain" option doesn't check the trust store.) However, 'certtool --verify' says the certificate itself isn't trusted because it doesn't link to the trust store. Content Credentials says it is valid and doesn't display any warnings. The chain usually ends at "Microsoft Supply Chain RSA Root CA 2022", which is a self-signed root CA that Microsoft includes in the untrusted chain. (Thus, "certtool --verify-chain" will validate the chain.) This certificate is unknown and not in the trust store. Making matters worse, Microsoft has two (2) legitimate root certificates in the CCADB and 33 additional certificates installed in Windows. This "Microsoft Supply Chain RSA Root CA 2022" is not one of them. Why isn't Microsoft's cert chained to a trusted Microsoft root? Everything makes this look like an impersonated certificate chain and not a verifiable and trusted chain. (If this is a legitimate cert, then Microsoft's developers should know better and link to one of their trusted roots!)

Seriously: thousands of examples and they all use untrusted certificates. Again: if you don't trust the certificate, then you can't trust the signature or the data that is signed.

Et tu, DigiCert?

Many of these signed C2PA images also use a DigiCert certificate for providing the signed timestamp. (Let's ignore the elephant in the room and how the signed timestamp can be trivially altered without detection.) The DigiCert timestamp certificate is also not linked to the trust store.

DigiCert provides a web page containing their list of certificates. This list explicitly says "DigiCert discloses all of its public root and intermediate certificates on Common CA Database." However, their "DigiCert Trusted G4 RSA4096 SHA256 TimeStamping CA" certificate is on their web page, but not in any links provided by the CCADB and not found in the trust store that is included on Linux, MacOS, Windows, Android, iOS, Chrome, or Firefox.

I also found other problems with their list of certificates. For example, all of their certificates can be downloaded in PEM (base64 text) and DER (binary) formats. However, the "DigiCert G5 Code Signing, ECC P256 SHA384 2022 CA1" certificate (a different cert) has a PEM link that returns a DER file. (The format doesn't match the link or the file extension.) I reported this problem to DigiCert (Case Number: 03923230).

Their first reply to me indicated that they couldn't help me if I didn't have an account with them. This, despite the fact that I was reporting a problem found on their non-login web page.

Their second response told me how to convert from DER to PEM. Converting it isn't a problem. The problem is that they have the wrong content on their web page.

And that's the rub. I've found three problems with their web page: (1) they don't have all of their root and intermediary certs in the CCADB as they claim, (2) they have wrong content on their web page, and (3) they don't seem to fix reported problems. There is no reason to assume that these are their only problems.

Moreover, these errors on their web site tells me that DigiCert doesn't double-check their content. If a hostile actor were to compromise and alter their web page to provide alternate certificates, I doubt they would notice. This is yet another reason why you should only trust certificates that are published by the CCADB; everything else is untrusted.

C2PA's Bad Workaround

Last December, I explained to C2PA and CAI leadership how to create C2PA forgeries using a self-signed certificate. Their solution was to flag all self-signed certificates on their Content Credentials web site as using an "unknown source".

Shortly after that change, Adobe/CAI made an announcement on their C2PA discord channel:

12/19/2023 5:46 AM

Hi everyone, quick important announcement about the Verify web app at https://contentcredentials.org/verify. There is now a "known certificate" list deployed to prevent self-signed certs etc from imitating known entities like Leica and the M11-p. This works similarly to browser cert lists, such that if a C2PA signing cert chains to something on that list, the Verify app will consider it "known." Early next year, the C2PA will issue clear guidance about how this will work going forward, as known cert lists are an essential part of the C2PA trust model. DM or ping in thread with any questions.

Although there were people (other than me) inquiring about this list, it wasn't disclosed until about a month ago. CAI announced their Verify tool known certificate list on 2024-05-28 on the CAI discord channel. Although they describe this as a temporary solution until they can define their conformance requirements, the entire approach is problematic:

Enabling Forgeries: Their trust list includes root CA certificates. Anyone with a root CA certificate in the C2PA/CAI trust list can generate forgeries for anyone. (I demonstrate this in the section: "Clowning Around".)

Monopoly: Their current (hopefully temporary) submission process seems simplistic, yet restrictive:

If you have an application that is in production and publicly available, you can request to add its signing certificate to the temporary known certificate list: Simply email verify-tl@c2pa.org.

We will review your request, and if it is approved, we'll ask for more details. Once we receive them and deploy the change, you will receive a confirmation email.
Since they are not an industry standard, who are the reviewers and what is the vetting criteria? This seems to be either trivial to social engineer (accept every request if they agree to some unspecified requirements) or enables a monopoly for vetting based on corporate interest (accept almost nobody). On the CAI discord channel, I've seen a handful of people request to have their certs added to this trust list. So far, I haven't seen any of them added.

Privacy and Tracking: The CAI's instructions request developers to download this list often because "the list is subject to change". However, this also places Adobe/CAI in a position to identify each download request. If they watch their logs, then they know each time an application retrieves C2PA/CAI's proprietary trust list. This goes toward privacy, tracking, and potentially identifying the usage.

Self-Hosting: CAI has their own set of trusted certificates that are not linked to the trust store. If every company can have their own self-hosted list of self-selected trusted certificates, then how do developers find them? (The official CCADB trust store is ubiquitous and not vendor specific.) And if a scammer can provide their own "you can trust me" certificate, then it defeats the purpose of using certificates for validation.

Missing Revocation: Let's pretend that someone hostile manages to get their certificate into the CAI trust list. Then they can issue their own impersonation certificates! What is CAI's recourse? Well, they could de-list the untrusted certificate. However, the C2PA Specifications (Section 1.4 in versions 1.1 through the initial 2.0, and Section 1.3.1 in their revised version 2.0) says "C2PA Manifests can be validated indefinitely regardless of whether the cryptographic credentials used to sign its contents are later expired or revoked." As long as the forgery was listed once, then it's good forever. Then again, a scammer can say that they were listed (regardless of whether it is true) and therefore it's always valid.

Incomplete List: Their current "trust me" known certificate list includes root CAs for Microsoft, Leica, Adobe, and Truepic. They also provide non-root "trust me" certificates for nine other companies. This is why Content Credentials can validate certificate chains that are not validated by any other means. Unfortunately, their known certificate list does not include the DigiCert timestamp certificate. Unless you figure out how to find it directly from DigiCert's web page (and trust it despite the other problems on their web page), you can't validate the timestamp certificates.

The alternative to having your untrusted signing cert placed in the Adobe/C2PA/CAI trust list is to pay a trusted CA provider to have a certificate linked to the official trust store. (It costs around $230USD per year.) However, Adobe, Microsoft, Truepic, Leica, etc. don't pay to be linked to the CCADB trust store, so why should you? (It's ironic that Microsoft helps manage the CCADB and doesn't use it for C2PA.)

The end result is that C2PA/CAI does not use the industry standard's CCADB for validation. Instead, they want us to trust them for their own out-of-band trust list. This is like getting an email from a spammer that says, "Questions about this email? Click here to validate it!" It's not a safe practice in spam and it's not a safe practice here.

All Except One

I did make one misstatement earlier in this blog entry. I said that, of the over 2,500 instances of signed images, none of them use verifiable certificates. However, that's not true. There was one (1) picture that used a trusted and valid certificate chain:

The signing certificate is provided by "NIKON CORPORATION". Although the signing certificate expired last year (2023-10-05), it was issued by "GlobalSign GCC R3 PersonalSign 2 CA 2020".

"GlobalSign GCC R3 PersonalSign 2 CA 2020" is in the trust store! It's found on Linux, MacOS, Windows, Chrome, Firefox, etc.! This is a valid certificate chain!

Hintfo identifies two different problems.
1. 'certtool' says the chain is invalid because the signing certificate is expired, not because it is untrusted. (It's fine to ignore the expiration date because C2PA ignores expiration dates.) This is a valid certificate chain because it links to the official CCADB trust store.
2. 'c2patool' says the certificate is invalid because the cryptographic signature does not match. Someone edited the file after signing, causing a cryptographic mismatch.
Although this picture used a valid and verifiable certificate chain, the content is explicitly invalid and untrusted because the signature does not match.

Content Credentials detects the signature mismatch and displays a big red warning message:

This is different from the other warning messages that were just gray boxes and easy to overlook. Content Credentials knows it is invalid due to the bad signature from the good/trusted certificate chain. As a result, it doesn't display any of the metadata.

Unfortunately, the red warning box has a problem. The first sentence is correct: the "information is not valid" because the signature had a mismatch. However, the second sentence is incorrect ("issued by an unrecognized source") since the signing source is known and recognized by a valid certificate chain that links to the CCADB trust store. (Unless... What if Content Credentials doesn't support the CCADB trust store? Then this known chain would be unrecognized to them! That makes this problem much worse because validation would depend on a completely proprietary solution from a single vendor: Adobe/CAI.)

At Content Credentials, you can click on a link to "Search for possible matches". This currently reports one match. The match has a valid signature that is dependent on the invalid signature. Again, wrapping a valid signature around an invalid signature doesn't make it more valid.

This particular image comes from the CAI test suite on GitHub. It's an example of a failed signature match. They have no examples of a valid signature using a trusted certificate chain.

Clowning Around

To demonstrate the problems from using untrusted certificates, I created my own root CA certificate:

# Generate a key. This requires you to set a password.
openssl genrsa -aes256 -out SelfCert.key 4096

# Use the key to generate a root CA that is good for about 10 years.
# This requires the password and certificate information.
# This is the 1st impersonation and hidden by Content Credentials.
openssl req -x509 -new -nodes -key SelfCert.key -sha256 -days 3650 -outform der -out SelfCert-ca.crt
openssl x509 -in SelfCert-ca.crt -outform pem -out SelfCert-ca.pem

I have a few options to make my certificate trusted:

Option A: Get CAI to add it to their list of known certs. (A malicious user can be as honest as necessary to get through the vetting process.)

Option B: Claim it was on their list but that it's no longer listed. It doesn't matter if this claim is fake since nobody (other than CAI) can validate it. This option works because:
1. Per the C2PA specification, a revoked certificate is always valid.
2. C2PA/CAI doesn't use a certificate revocation list (CRL).
3. Even if C2PA introduces a CRL, certificates used before the revocation date are still valid. A forgery can always backdate the signature to predate the revocation.

Option C: Assume that the site checking the certificate doesn't actually validate the certificate chain. This is a good assumption since, until a few weeks ago, nobody could download the CAI known certificate list to validate their untrusted certificates. (CAI's temporary trust list was announced in the CAI discord channel on 2024-05-28. If you're not in that forum and reading every posting, then you probably never heard about the CAI known certificate list; it was never announced on the CAI members mailing list.)

With my new root CA, I can issue a signing certificate for anyone!

# Use the key and root CA to create a signing certificate.
# This requires the certificate owner's information.
# This is the 2nd impersonation and displayed by Content Credentials.
# You can give the signing cert a blank password.
openssl req -new -CA SelfCert-ca.crt -CAkey SelfCert.key -newkey rsa:4096 -sigopt rsa_padding_mode:pss -days 3600 -extensions v3_ca -extensions usr_cert -addext 'keyUsage = digitalSignature' -addext 'extendedKeyUsage = emailProtection' -nodes -x509 -sha256 -keyout 'SelfCert-sign.key' -out 'SelfCert-sign.pem'

# Combine the PEM files into a chain for the C2PA manifest:
cat SelfCert-sign.pem SelfCert-ca.pem > SelfCert-chain.pem

# The C2PA manufest (json) uses the chain and the signer's key.
"sign_cert": "SelfCert-chain.pem",
"private_key": "SelfCert-sign.key",

Here are a few examples that I generated in minutes:

	It's signed by Adobe! As shown by Hintfo, 'c2patool' says the certificate is valid because it didn't verify the chain. 'certtool' says the chain is valid because the chain includes my root CA. The certificate itself is untrusted because it doesn't link to the trust store, but that's no different than anything else signed using C2PA. Content Credentials shows a gray box warning that the certificate was issued by an unknown source, but it doesn't say you shouldn't trust it. It identifies that this picture was issued by Adobe Inc. (Don't worry about the metadata's camera type, GPS location, or dates; I can set those to whatever I want.) If I go with Option A (above) and get my root CA inserted into their trust list, then the gray box warning goes away. And if they won't insert my root CA into their trust list, then I can pivot my attack and claim they are using their monopoly to prevent other companies from participating. (Either way, it's bad for them.)
	It's signed by Microsoft! Again: Hintfo's 'c2patool' says the certificate is valid because it didn't verify the chain. 'certtool' says the chain is valid because the chain includes my root CA. The certificate itself is untrusted because it doesn't link to the trust store, just like everything else signed using C2PA. Content Credentials says that the certificate was issued by an unknown source, but it doesn't say you shouldn't trust it. It identifies that this picture was issued by Microsoft Corporation.
	It's signed by Nikon! Once more: Hintfo's 'c2patool' says the certificate is valid because it didn't verify the chain. 'certtool' says the chain is valid because the chain includes my root CA. The certificate itself is untrusted because it doesn't link to the trust store, but that isn't unusual for C2PA's signatures. Content Credentials says that the certificate was issued by an unknown source, but it doesn't say you shouldn't trust it. It identifies that this picture was issued by Nikon Corporation. Again, don't worry that the camera's metadata identifies a Samsung; I can trivially forge the metadata and then sign it with whatever signature I want.

These clown examples use my self-generated root CA certificate to create arbitrary signing certificates for any other company. This is the same approach used by Truepic. They created their own "Truepic RootCA" certificate and used it to issue signing certificates for the BBC, Hugging Face AI, and OpenAI for DALL-E and ChatGPT. The only differences between my signing certificates and Truepic's:

Truepic is a member of C2PA, while I couldn't agree to the membership's gag order prohibiting public disclosure. (See their membership terms, page 7, item 11, 2nd sentence.)

Currently, the CAI "trust me" list of known certificates only includes C2PA members. Truepic's self-generated RootCA certificate is in the CAI "trust me" list. My demonstration certificate isn't (currently) in their list. As a result, Content Credentials displays a gray "unknown source" warning box. (We'll get around that in the next section.)

Truepic appears to be distributing their signing certs as a product. My certificates are used to demonstrate the vulnerability.

We assume that Truepic has permission from these companies to generate signing certs for them. However, permission is not required for generating a signing certificate, as demonstrated by my examples and the Tiananmen Square forgery's fake Adobe Stock Photos signature.

In each case (Truepic, clowns, and Tiananmen Square), the signing certificate is not linked to the trust store, does not comply with industry standards for public use, and even with CAI's "known certificates" list, users cannot automatically tell if the signer is authorized. (You just have to trust us.)

These examples demonstrate how easily someone can forge the credentials for any C2PA-signed media. I can change the camera information, the attribution, the GPS coordinates, the timestamps, and the signatures.

(I want to thank Bing Image Creator for rendering these wonderful pictures from prompts like "Circus performers dancing near a trash receptacle that is ablaze. In the background is the ocean. In the foreground is a tall wooden pole. At the top of the pole is a public announcement speaker." I call these pictures: "Clowns and their dumpster fire from sea to PA.")

See One, Do One, Teach One!

So how was the fake Tiananmen Square picture created? I'll show you! I took my forged Nikon clown picture and uploaded it to Microsoft Designer. I didn't make any changes. I uploaded the picture with my fake metadata and impersonating C2PA signatures, and then downloaded it as a JPEG:

Microsoft will sign anything. Now we have a picture that is signed twice. The inner signature is my Nikon forgery and the outer signature is Microsoft using their untrusted "trust me" certificate chain.

Adobe's "c2patool --certs" only extracts the outer certificate chain. (Hintfo uses this to extract the C2PA information.) Unless you build your own tool, you cannot validate the inner chain.

If you build your own tool and extract the inner chain, then GnuTLS's "certtool --verify-chain" says the inner chain is valid because my forgery included both the signing cert and the self-signed root CA. The 'certtool' program won't consult the local trust store if the self-signed root CA is present in the chain.

GnuTLS's "certtool --verify-chain" on the outer chain says it is valid because Microsoft included both the signing cert and their own self-signed root CA. (Hintfo uses this to evaluate the certificate chains.)

For both the inner and outer certificate chains, GnuTLS's "certtool --verify" says certificates are untrusted, but that's the norm for anything with a C2PA signature.

The Content Credentials web site says it was signed by Microsoft (because it was). The file has one dependency that came from Nikon. There are no errors, no warnings, and no mention that the previously-untrusted chain is untrusted. Why this works: C2PA appears to accept nested certificates as long as the chain is complete. (Completely based on trust!) They never check to see if the inner chain is trusted.

I think it's time that we stopped pretending that C2PA only has minor issues. C2PA is fundamentally broken at multiple levels. Products incorporating C2PA provide no reliable provenance or authentication; it's snake oil. At minimum, C2PA needs an open and public review along with a ground-up redesign. Companies that still claim that C2PA works are lying to their customers and enabling fraud-by-authentication.

But It Gets Worse: Intentional Misrepresentation

Private root certificates (not linked to the industry-standard trust store) do serve a purpose. They are useful when the certificates should only be used by a restrictive group of people, such as your employees or specific customers. For example, different branches of the US Government have their own root certs that are only for use with their specific divisions (NASA, Department of Treasury, Department of Homeland Security, etc.). The City of York (UK) provides their own root certs for their employees, and companies like IBM use private root certs for specific projects. These are just a few examples; organization-specific private certs are common.

The problem with C2PA/CAI's use of private certificates for public signatures is that we, the people trying to verify these signatures, are not the employees or customers of C2PA, CAI, or their affiliates. Prior to Adobe/CAI releasing their "temporary known certificate list" (about a month ago), developers could see the C2PA signatures from Adobe, Truepic, Microsoft, BBC, etc., but had no means to independently validate the certificates. And now, forcing C2PA users to install the private certs managed by Adobe/CAI makes C2PA a vendor-specific solution; all validation capabilities are effectively controlled by a small group within Adobe/C2PA/CAI. The C2PA architecture and CAI implementation (both led by Adobe) has the wrong use model for widespread adoption. Because Adobe/C2PA/CAI uses the wrong use model, their solution enables impersonations and deters independent verification.

In this blog series about authentication, I've detailed many different problems with C2PA. However, not all of the problems have the same underlying cause and some could be described as oversights or bugs. To put this certificate problem into perspective:

The unverified timestamp problem could be attributed to a lack of testing and a misunderstanding among Adobe developers regarding how timestamp certificates works.

The "completely based on trust" problem could be due to a lack of technical review along with little or no thought about how C2PA could be misused.

The "BBC authenticated a forgery" problem could be due to a lack of training for their vetting staff and an overzealous effort to promote C2PA.

The unsynchronized timestamp issue could be resolved if they just kept their clocks synchronized.

But the use of untrusted certificates? That is absolutely a conscious effort. You do not "accidentally" generate a root CA certificate, issue your own signing certs, and build an infrastructure for using and distributing them. Moreover, C2PA members, including Microsoft, Google, and Amazon, are very aware of the CCADB (because they have their own roots published in it) yet don't use it for C2PA signatures. For some unknown reason, C2PA/CAI does not want to use publicly verifiable certificates. They want to enable unverifiable signatures over unverified content, and they want to claim that it can be trusted.

In their media blitz, the C2PA promoters have repeatedly stated a desire to protect elections and combate deepfakes. However, they really seem to be doing the opposite. They appear to want to help people alter election outcomes, sway opinions through the use of authenticated forgeries, and enable the use of propaganda and other influence campaigns through false provenance and unverified authenticity information. C2PA isn't helping solve the problem; C2PA is enabling the problem. C2PA is the problem.

Read more about Authentication, Forensics, FotoForensics, Image Analysis | Comments (8) | Direct Link

Perceptual Searches and Applied Heuristics

Saturday, 22 June 2024

I've been seeing a steady increase in new search-by-image (perceptual search) systems. Much of this work is related to the image provenance problem: find the original version. I've seen a variety of new search technologies, including four new approaches this year. Some identify variations of the same picture, while others can identify instances of the same objects.

A Perceptive Eye

Different perceptual search engines serve different purposes. I often find myself using a combination of Google Image, Bing Image Search, and TinEye to track down variations of pictures and related information. I also have my own private perceptual search engine at FotoForensics (not publicly accessible due to privacy concerns) that I use for finding variations of the same picture based on the uploaded content.

When I see new image search technologies, I immediately try to identify their strengths and capabilities:

Has it indexed pictures from across the internet, or from a small niche market? Google and Bing have indexed large swaths of the internet. TinEye has a smaller collection that covers stock photos and older viral images. The standalone commercial FotoForensics system includes a RootAbout search engine that indexes the client's collection (and not the entire internet).

What kind of resources does it need? My RootAbout engine can compare over 3 billion images per second per CPU, with no special hardware and only a few megabytes of disk space for indexing millions of pictures. (RootAbout is fast and lightweight; it takes about 12 opcodes per comparison, but with pipelining, it runs at the speed of the CPU.) In contrast, Google's Image system is hardware intensive and requires lots of disk space. (If you don't have dedicated hardware, then it probably won't work well.) Some of the other solutions I've seen vary from dedicated distributed computers to multiple CPUs and large local archives.

What does it identify? RootAbout is limited; it only finds variations of the visually same picture. If you take a picture of a chair and scale, crop, or do minor edits, then RootAbout will still match it. However, if you take two photos of a chair from different angles, then they are two visually different pictures and RootAbout won't match them. Other engines, like Google and Bing, do object identification and would match the chair at different angles. But then again, the same chair will likely match other visually similar chairs; I haven't seen any object tracking systems perform unique object identification.

What does it return? Some engines list the files that matched. Others identify the types of objects, any printed text in the image (OCR), decoded barcodes, etc.

How sensitive is it to rotation, cropping, recoloring, and edits? TinEye and Google often miss mirrored images, which is why many scammers flip pictures.

It's not that one search engine is better than any other. Rather, each provides a different range of capabilities. You should choose the correct tool based on the problem you're trying to solve.

What is Authoritative?

When examining new search technologies, I also try to determine the developer's use case and intent. If they intend to create another way to find similar pictures, that's great! But if they think it will be a magical solution for establishing authentication or provenance, then that's a problem.

For example, in one of the chat forums that I frequent, a small company announced their new image search engine. They intend to use it for determining image provenance. One of the people in the forum (not me!) asked, "How do you determine who's the original creator? (by the earliest published timestamp?)" The reply? "Yes, earliest published timestamp from the social posts." Ouch.

Sadly, depending on "the oldest one" or "the first sighting" is very common and a really weak heuristic. (A heuristic or rule of thumb is a shortcut or rough approximation based on partial information.) In fact, there are a bunch of common heuristics that developers often rely on for resolving "which one is most authentic" quandaries. These include:

Oldest is original. The basic belief is that the earliest sighting hasn't had time to go viral or undergo lots of alterations. However, often the original is posted after the controversy begins or is never provided. Additionally, the search engine may not have indexed all content on the internet. "The oldest that we indexed" may not be the oldest instance out there. Relying on something like the "earliest published timestamp from the social posts" assumes that you can index every social media service and cross-posting, have access to private forums, the "original" media was posted to the social network, and there was no delay between image creation and forum disclosure. None of these are good assumptions.

Biggest dimensions. When pictures go viral, they are often resized smaller. While the biggest dimensions usually suggest the highest quality picture, it's not always the case. Scaling a picture larger or outpainting algorithms (where AI extends picture's edges) are examples of larger dimensions not being more authoritative. For example, the GitHub project stablediffusion-infinity has a great example where outpainting was used to make a famous painting ("Girl with a Pearl Earring" by Johannes Vermeer, c. 1665) larger:

This heuristic can cause big problems when used for automatically determining authenticity. For example, Truepic Inspect (a defunct service) used "image quality" (a combination of dimensions and last-saved quality) to automatically assign a trustworthiness value. When I evaluated their service, I found that simply scaling a picture larger made them report that it was more trustworthy.

Biggest file size. In general, higher fidelity requires more data, so a bigger file size may be assumed to reflect better quality and fewer resaves from alterations or viral distribution. However:
- PNG files are almost always larger than an equivalent JPEG, but no cameras natively capture photos as PNG. This means that the larger PNG file is never an original photo.
- If I have a JPEG that was last saved at 80% quality, and then re-save it at 100% quality, then the file size will be larger even thought the fidelity won't improve.
- Some mobile devices, like Apple and Samsung, can include additional images after an in-camera alteration. These depth-maps and preview images almost always makes the file larger, even though the image is not more authentic.
While the file size is a good initial heuristic, it's far from authoritative.

Most metadata. In general, camera original images have a lot of metadata, while viral distribution (Facebook, Twitter/X, Telegram, etc.) typically strip out the original metadata. So if you see a picture with more metadata, then it may not be a viral distribution. However, many applications record changes in metadata, so more alterations can also result in more metadata. Just consider something like XMP; some cameras add a little XMP, but lots of XMP usually indicates an alteration. Good forgeries often copy over camera-original metadata. (The AvtoVAZ forgery method was great at replicating metadata.) Relying on the most metadata could lead you to the most-altered version.

C2PA provenance. In theory, C2PA's metadata should record how a picture was created and where it came from. In practice, it is unreliable at best and completely misleading at worst. On top of this, Adobe and Microsoft will digitally sign anything; they don't validate the metadata before signing. Solutions that rely on any kind of C2PA metadata today are fundamentally flawed. What I tell my customers: if you see C2PA metadata, then you should immediately assume that it was altered by an application and not trust anything without external confirmation.

C2PA timestamp. This also goes toward the belief that the "oldest is most authentic". However, timestamps from local signing are often incorrect, easy to fake, and permit altering without detection.

Heuristics are not intended to be authoritative. (If they were authoritative, then they wouldn't be heuristics.) Some developers think they can combine these heuristics to make a more reliable system. For example, "biggest dimensions and most metadata and C2PA information." The problem is, these attributes don't combine like probabilities; you can't add them up or multiply their metrics. Rather, heuristics combine like a min/max functions or the weakest link in a chain: they are as reliable as the most reliable heuristic. And as a heuristic, none of them are authoritative or very reliable.

Proof By Example

As an example of how these different heuristics work, consider this picture (click to view it at FotoForensics):

On 2024-06-13, FotoForensics received this picture of orange flowers with a white border. I don't know who uploaded it or the background story behind it. The file is 1024x768 and 968,511 bytes.

Using my own perceptual search engine, I found five other versions of this picture at FotoForensics. None of them have the white border. For easy referencing, I assigned a title to each of them based on either the uploaded filename or where it came from. While each picture is visually similar, they are all different variants:

	Title: "Orange". First received at FotoForensics on 2023-01-22. At 99,644 bytes and 698x527, this is the smallest version I've seen.
	Title: "Facebook". First received on 2022-03-21. The image is 1024x768 and 163,950 bytes. The upload came from Facebook. However, Facebook only distributes content that came from elsewhere. This only identifies that it originated somewhere else before it was uploaded to Facebook.
	Title: "WelcomeScan". This is also 1024x768 but is 228,040 bytes. The metadata attributes the picture to "Natphotos/Digital Vision/Getty Images", indicating that it is based on a stock photo. The IPTC metadata doesn't identify the specific stock photo ID, but says it was dated 2004-04-09. (I'm taking all metadata at face value and not assuming malicious alterations.)
	Title: "Fleurs". This is also 1024x768 and attributed to the same 2004-04-09 stock photo, but the file is 516,424 bytes.
	Title: "Bienvenue". While this is the same file size (516,424 bytes) and same dimensions (1024x768) as the Fleurs picture, the metadata is a little different. This file was first uploaded to FotoForensics on 2013-04-06, making it the oldest direct sighting. It is the only version that has been repeatedly uploaded to FotoForensics (92 times!), making it the most popular variant. However, it's still not "original". Besides being a stock image, it contains no camera information in the metadata and was last saved using Adobe Save-for-Web.

Google and Bing image searches also turn up lots of versions of this picture. For example, Google Search found an older version at Flickr:

Flickr received this on 2016-05-12 (nearly 3 years after FotoForensics first saw it). Flickr's "Original" size is 1024x768, but only 228,076 bytes, making it a smaller file that my earlier version. And while the metadata does identify it as being a stock image from Getty, the only date in the metadata denotes when it was re-encoded by Adobe Photoshop Elements for Windows on 2006-02-06 (nearly two years after the other stock image dates seen in the other variants).

TinEye helped me find the picture at IMDB. It's the cover for a short movie called Spring Day by Rouzbeh Rashidi (circa 2000). The odd thing is, it's supposed to be a six-minute black-and-white movie, but this is a very colorful image. I suspect that the cover art was added after the movie came out. (The cover art image is dynamically scaled by a service at Amazon, so there is no informative metadata. We don't know when it was added to IMDB.) This instance of the orange flowers could be a red herring.

While TinEye does identify some stock photo services with this image, none of the stock services seem to have it online. For example, ShutterStock reports, "We're sorry! 1829668118 is no longer available. You might be interested in the related results below." The rest of the links are either unavailable or clearly not authoritative. (There's a common scam where people find pictures online and uploaded them as the author to various stock services, in hopes of making a quick buck from licensing. They rely on the fact that these stock services don't authenticate the uploader's ownership claim.)

Applied Heuristics

Now we can apply the previously defined heuristics to see how they hold up for this one sample image:

Oldest is original. The heuristic's interpretation varies based on how we define "old". For example:
- The oldest version at FotoForensics is from 2013; it's also the most popular variant.
- The oldest confirmed timestamp online is from Flickr, which says it was received in 2006.
- Going by internal metadata, I have three different versions that identify the picture as being from 2004-04-09. But since they are all different variants with the same information, so I don't know which is the official version.
- There's an unconfirmed sighting at IMDB, linking the image to a movie that came out in 2000. While I don't think this is accurate, it is the oldest attribution.
In each of these case, none are the "original" picture. They are all re-encoded versions that lack original metadata. This heuristic doesn't identify the original image.

Biggest dimensions: Many of these variants have the same big dimensions: 1024x768. However, I can use Amazon's dynamic image generator to render the picture larger. (How about 2000x1500!) Although the source "white border" picture has the same dimensions as Facebook, WelcomeScan, Fleurs, and Bienvenue, it has a big white border. If we exclude the border, then the visually similar content is smaller in the white border version than these other variants. So this heuristic suggests that the white border version isn't the most authoritative. Unfortunately, since we have multiple picture with the same "biggest dimensions" (1024x768), we don't know which (if any) is the original.

Biggest file size. The white border version is the only PNG and it's almost twice the size of the largest JPEG. The largest JPEG is Bienvenue, which is also the most popular variant. However, it was encoded as a JPEG at 100% quality. The other JPEG images are smaller, but were also saved at lower qualities, which results in smaller file sizes. (Fleurs is at 99%, WelcomeScan at 94%, and Facebook at 81%.) Saving the image at a higher JPEG quality doesn't improve the image; it only mitigates further degradation. Based on the JPEG entropy (amount of noise) in the picture, Bienvenue has a cumulative degradation equivalent to a JPEG saved one time at 55% quality. It may be a big picture, but it's really low quality. (Nobody uses JPEG entropy as a heuristic because edits can increase the value. Based on the entropy, WelcomeScan is the highest quality at 58%, and that's still very low.)

Most metadata. Fleurs has the most metadata because it was imported into Windows Photo Gallery and then manually labeled. Excluding the color profile information, the Flickr version has the most metadata because Photoshop Elements added in post-processing information. And if we evaluated based on metadata size, then the white border version has the most metadata because it includes C2PA information and that has a huge overhead.

C2PA provenance and C2PA timestamp. Only the white border picture has C2PA metadata. According the C2PA metadata, this picture came from Microsoft Designer on 2024-06-13. Keep in mind, Microsoft Designer came out around October 2022, so this is definitely not the original source for the picture. This file contains no other information about the image's provenance. Since we found variants of this picture dating back over a decade, we can be certain that the C2PA provenance information is incomplete and misleading.

(Again, I'm taking all of the metadata at face value. I'm ignoring the fact that the C2PA signed timestamp is trivial to alter and the signer's clock is likely wrong.)

There's only one sighting with C2PA metadata. But let's pretend that all of these picture have accurate C2PA provenance information that is consistent with the other metadata that exist in each file. We still don't have any consistency and don't know which one is the 'authentic' version. C2PA does not help here. As heuristics go, relying on C2PA is the worst option.

Heuristics are not authoritative. Analysts use heuristics to sift through large amounts of data, but it requires an understanding of the material, reviewing the content, and perseverance to get near the ground truth.

How to 'Unsuckify' Heuristics

When it comes to analysis, you need to use the clues you have. While none of these pictures identify the original source, these heuristics do provide great clues. For example:

Minimum age. When was it first seen? This sets a "not more recent than" time period. This may not be the absolute first appearance. In this case, FotoForensics says that this picture has been around since at least 2013. TinEye found their first sighting on 2008-02-04, and Flicker shows that it has existed since at least 2006. If we trust the metadata, then that further backdates the age to "at least 2004", making it 20 years old!

Popularity. Based on how many sightings there are, we can approximate whether it is widely known or viral. In this case, there are relatively few sightings. (A viral distribution would be like pictures from One Direction fans or followers of Kate, where the variants came in massive floods from a wide range of sources.) With a lot of digging, you might get lucky and find copies of the picture before it went viral.

Stock Imagery. Does it appear on any stock photo services? This doesn't prevent stock photo services from indexing viral or public images, or from providing false attribution. But it's a good starting point. Also, many stock sites identify when it was first available, further backdating the minimum age.

Alterations. Are there any differences or edits between versions? It's really subtle, but someone used a clone tool to remove a rectangular region from the top right corner of the WelcomeScan picture. If I had to guess, they removed an attribution. In the other variants, the colors have been altered. So at minimum, I know that the undiscovered original had slightly different coloring. And along one of the viral paths, someone added in an attribution and then it was later removed.

Context. Similar versions of the image may be on web pages that provide background information. In this case, many of the filenames are in foreign languages: Fleurs (French), Bienvenue (Spanish), and from the popular Bienvenue uploads, I've also seen names like "Digitalização de Boas-vindas" (Portugese), "Begrüßungsscan" (German), "Skenirana datoteka dobrodošlice" (Croatian), and more. Many of these names translate as "Welcome Scan". A few of the other common names for this file include "Test Picture" and "Garden" (both in a variety of languages). This tells me that the picture comes from something international.

Using the 'context' heuristic, I was able to find the ground truth. A quick Google search for the text "Welcome Scan" immediately identified a page at the Internet Archive: Welcome Scan and it also shows the picture!

The expert archivists at the Internet Archive recorded this information in 2019:

Publication date: 2004-04-09
Topics: windows, vista, fax, and, scan, fax and scan, garden, sample, pictures, sample pictures

Taken from Windows Vista's Fax and Scan program. On Windows 7 and beyond the image is replaced with the sample version, which is 1024x768.

The sample image is called Garden. The "high quality" version is called Welcome Scan.

The highest quality "Welcome Scan.jpg" version at the Internet Archive is 2154x1437 and 709,832 bytes (biggest dimensions and JPEG size found so far).

The picture has almost no informative metadata besides the abstract name "454054.TIF" and being saved by Adobe Photoshop 6.0 on a Mac. (Adobe Photoshop 6.0 was released in September 2000, so that works with a 2004 attribution date.)

It was last saved using Adobe Save-As quality 8 (equivalent to the JPEG standard at 80%).

The colors were still altered (made brighter), but not as much as the later viral version seen in Bienvenue and Facebook.

This highest quality version at the Internet Archive does not mention Getty or other stock photo services. However, that attribution is mentioned in their smaller "Garden.jpg" 1024x768 version.

Windows is international software and their "fax/scan" software explains the file name "Welcome Scan".

Additional Information

Evaluating pictures and tracking origins is often done on a case-by-case basis. Perceptual searches are one type of tool in the analyst's collection. Based on the results, we might be able to identify the art or artist (especially if the work is famous), possibly copyright information, or similar objects or products, such as clothing or furniture. (Bing Image Search is pretty good at this.) If the content is linked to a country or region, then that can further help identify the source.

However, perceptual searches are far from perfect. The accuracy depends on the search engine, amount of indexed media, and quality of the image. Very unique or unpopular images may have no search results. Additionally, there may be privacy concerns related to submitting an image to someone else's service.

While I am absolutely thrilled to see a larger variety of perceptual search systems becoming available, I'm concerned about the focus on making them do something that they just can't do. These search engines can find similar pictures from a known collection. They cannot tell you which one is more authentic.

Read more about Authentication, Forensics, FotoForensics, Image Analysis | Comments (3) | Direct Link

On The Button

Sunday, 16 June 2024

Maybe I'm getting older and losing my hearing, or maybe The Boss is getting quieter. Either way, when I'm in my office, I can't hear her calling "Neal!" So... I got her a bell. She can pick up the bell and ring it to get my attention. The problem is, either she's ringing it really quietly or I'm really deaf, because I can't hear the bell when I'm in the office. (Since The Boss complains that the bell is really loud, I think it's more "someone quietly ringing the bell" than my hearing.)

I know a few (elderly) people who have intercom systems in their homes. Seriously, honest-to-god hard-wired intercoms! While these are certainly effective, this is technology from the mid-1970s. I know I can do better.

I also know plenty of people who regularly text or call each other from inside the house. But this assumes your phone is nearby and charged, which is not always the case for me. Also, I need to adjust my glasses to read any text messages.

An IoT Device

This week, I decided to make a better way to address this pager problem. I wanted something as simple as an intercom system but without being hardwired to the home. I ended up making my own network-enabled button. I call it the "NetButton". (I know, a really clever name!)

The bottle is about 1.5 inches (3.8cm) in diameter and 2 inches (5cm) tall. That's just the right size for holding in your hand.

The Hardware

Although I think I'm pretty good at programming, I'm not much for electronics and my soldering skills seriously suck. But I was able to put this thing together without much problem.

When you reach a certain age, you end up taking lots of prescribed medications. I have a box full of old pill bottles because you never know when you need a small container for holding nails, screws, candy, or other things. For this project, I found the form factor to be just right and the bottles are easy to cut with a pocket knife.

I cut a small hole in the plastic lid and mounted a simple push button.

For the microcomputer, I used an M5Stamp-S3 that I had sitting around from a previous project. For something so small and inexpensive, they are extremely powerful. Dual-core CPU, plenty of RAM, built-in button, built-in super bright multi-color LED, built-in wifi, and lots of pins for I/O. If you know how to program an Arduino or ESP32, then this is the exact same thing. (I used the Arduino IDE for programming it.)

Besides the microcomputer and button, I included a small piezoelectric speaker and a tiny breadboard for putting it all together. (The breadboard permits me to easily reuse the parts for other projects in the future.)

The button goes from ground to one of the I/O pins. The speaker goes from ground to a different I/O pin. That's it. I didn't need to wire in an LED since the M5Stamp-S3 already contains a light.

The M5Stamp-S3 is powered by a USB cable. One end of the USB cable goes into a USB power brick and is plugged into a wall outlet. For the other end, I cut a small hole in the side of the bottle for the cable to plug into the microcomputer. The hole is a tight fit, so the USB cable holds everything in place. No glue, no tape, and nothing moves. You can shake the bottle and not break any of the electronics.

It was so easy to make, that I built three of them with red, green, and yellow buttons.

The Software

The programming took me a few evenings, but I gave it a lot of features:

When it first boots, it starts up a wifi access point. You connect to the AP and configure the device. You select the local wifi network, enter the wifi password, and specify how you want it to report. You can also name the device. (I called them "NetButton Red", "NetButton Green", and "NetButton Yellow", again, because I couldn't think of better names.) The configuration is saved in static memory, so you only need to do this once.

After it is configured, it drops the AP and connects to the local wifi. Then it just listens for events, either button presses or connections from the local network.

If someone pushes the button, then the local device shows a light for a few seconds. At the same time, it connects to an HTTP URL (from the configuration page) to report that a button press happened and it transmits a broadcast UDP packet announcing that the button was pressed. The other netbuttons listen for the UDP packet.

When any NetButton hears the broadcast UDP packet, it flashes a light and sounds an alarm for 5 seconds: BEEP silent BEEP silent BEEP silent. This tells you that someone pressed a button. (I added a check so the sender won't trigger itself.)

If someone holds down the button for a few seconds, then the light changes from green to red, and it sends out a different broadcast alert. Any listening NetButton will change the flashing light pattern and switch the sound to a siren: BEEP beep BEEP beep BEEP beep.

If you hold down the button for 30 seconds, it does a factory reset. This restarts the AP in case you need to reconfigure it.

The basic idea:

If you need someone, then press the button once. If it's an emergency, then hold the button down for a few seconds.

The button being pressed doesn't make a sound; it just lights up. Instead, pressing the button makes all of the other NetButtons sound alarms.

By using a broadcast UDP packet, I can trivially add more NetButtons as they are needed. Right now, there's one in the bedroom, one in the living room, and one in my office. If needed, I can add more for other rooms. The only requirement is for all of the NetButtons to be on the same wifi network.

The NetButtons also send messages to a configured URL. It can easily interface with services like "if this then that". In my case, I have my own logging system.

But I didn't stop there. The NetButtons also send a periodic heartbeat message. This way, I can determine if one of them is offline. (That's how I figured out that I accidentally plugged one of them into a light switch outlet. When the lights were off, the device was off and the heartbeat wasn't received.)

I also allow other commands to be sent from a computer, either to specific NetButtons or to all of them. The commands? Things like "light" (the light turns on for a few seconds; great for debugging), "chirp" (a single beep), "silence" (don't sound the audible alarm; just blink the lights in case I'm in a meeting), "alarm", "siren", and more. I'll probably put in a lot more functionality as I realize what else I need.

Other Uses?

Right now, these NetButtons are mainly designed to make it easy to call me out of my office. However, it can work for lots of situations:

"Time for dinner!"

"FedEx just left 20 boxes by the door, can someone help me move them inside?"

"We have a door-to-door solicitor. Can you make them go away?"

"I'm sick with a cold/flu/covid/whatever, can you get me a glass of water?"

"I wrote another blog. Can you proof-read it?"

Regardless of the need, this will be a great way to call for assistance.

Read more about IoT, Network, Programming | Comments (5) | Direct Link

Problems with C2PA and LinkedIn

Saturday, 8 June 2024

When writing about C2PA problems, I often receive feedback from people who are shocked and disturbed at how weak and ineffective the C2PA solution is. At the same time, these same people often feel peer pressure to support C2PA or are under the belief that these issues can be "patched". On this "wait for the patch" mentality, I want to be explicit: No, there are no simple fixes or easy upgrades to mitigate these critical problems within the C2PA architecture.

Implementation Defects

There are different types of software defects. The first kind are programmatic mistakes or bugs. Maybe the programmer forgot to initialize a variable, had a typo in the code, or didn't check if a dependency existed before calling it. These often cause crashes, errors, or other types of failures. While some bugs may be hard to identify, a simple change to the code or process usually resolves the issue. The "patches" often replace one part of the code, but not the entire thing.

Programs that interconnect with other programs need some kind of interfacing handle. These are often provided by an Application Programming Interface (API). The API is any kind of well-defined method for interacting with the code. In my previous Jitter Bug problem, the bug was in the operating system's kernel. The patch required replacing the kernel. However, this didn't change the kernel's API and I didn't need to replace any other piece of code.

For the most part, my blog entries about C2PA's problems are independent of the implementation. I have not focused on bugs in specific pieces of code. But this doesn't mean that their code is bug-free. (If I started pointing out their implementation problems, then it would be like shooting fish in a barrel.)

An example of a C2PA implementation defect (bug) was recently reported in their Discord channel. A developer noted that two different implementations of the same C2PA specification behaved differently (my bold for emphasis):

I think the `ec_validator` should be updated, but wanted to run the scenario by you. We incorrectly signed our data with ECDSA w/ SHA-384, leaving the signature in DER/ASN.1 format. This works fine with the `ec_validator` in `c2patool` usage, as it is smart enough to check if it is P1363 format or in DER format.

The `SubtleCrypto::verify` call will fail to validate, because it is expecting the signature to be in `P1363` format. In other words, the `c2patool` says valid but the `c2pa-js` modules say invalid. I think it would be better and safer if the non-wasm code just errored if not in P1363 format, does this sound reasonable? If it does, I can write up an issue on the repo and possibly put up a PR with such.

This inconsistency could be due to a missing definition in the specification that caused two different developers to make conflicting decisions, or maybe it's an oversight by one (or both) developers. Either way, as soon as they identify the correct way to handle this condition, the developers should be able to easily patch the code and provide an update that generates consistent results for this specific case.

While patches can often address implementation problems, not all patches are great solutions. Some of these fixes are "ugly hacks" that work around the immediate problem but don't address the core issue. For example, last November I itemized C2PA's problems. This led to a conference call with representatives from C2PA and CAI (who all work at Adobe). They rushed out a patch, but it wasn't a great solution. Their fix? C2PA/CAI/Adobe's Content Credentials web site no longer accepts self-signed certificates. Now they flag all media with self-signed certificates as untrustworthy. However:

If a forger wants to pay $230(USD) to a certificate authority (CA), then they can have a "trusted" certificate for generating their forgeries. (Similar to charging Twitter/X users for a blue "verified" checkmark.) Unfortunately, paying doesn't authenticate the content; paying only authenticates that some unidentified person paid. This ugly fix limits people from using free self-signed certificates, but doesn't prevent paid forgeries.

The fix appears to only be deployed on the Content Credentials web site. The command-line 'c2patool' (latest version 0.9.2) still reports self-signed certificates as valid and with no indication that it is self-signed.

In other words, their "quick fix" only addresses one corner-case usage, and it introduces more problems than it solves. (I've heard this type of fix called a "knee-jerk patch", because it is a quick reaction that isn't fully thought through.)

Implementation Oversights

Another type of common bug happens when developers fail to implement the full specification, or misinterpret the specifications. Given the size and scope of the C2PA specification, we can expect developers to take shortcuts, ignore error conditions, fail to report important information, or incorrectly report details.

As an example, this picture was generated using OpenAI's DALL-E and contains C2PA metadata:

You can see the full technical details at Hintfo. However, most people just want a short summary.

If you view the short summary at Content Credentials, then you see misinformation in the bottom-right corner, where it provides details about the signature:

Content Credentials lists who signed the file (in this example, "OpenAI") under the label "Issued by". If you dive into the C2PA manifest, you'll see that the "Issuer" field was set by the user as a means of self-identification. This is not part of any cryptographic signature and not authenticated. If some other user wants to generate a C2PA record, then they can also specify "OpenAI" or any other name. This information is not verified. As a consumer, you must trust that this unauthenticated field isn't part of an impersonation.

Content Credentials list the "Issued on" date (which should be "Signed on" date), but they don't identify that the date comes from DigiCert and not OpenAI. In fact, the certificate from DigiCert isn't mentioned anywhere! As a user, you have no idea who provided this time stamp or whether your should trust them!

These implementation oversights give uninformed users a false sense of trust in a solution that claims to provide "provenance and authentication."

Modular vs Monolithic

There's a truism in software development: every minute you spend on the design cuts development time by an hour. A good design with a good programmer will result in a good solution. A good design with a bad programmer may have implementation defects, but they can still be fixed with patches that don't impact the design, API, or anything else.

However, a bad design will have problems regardless of the programmer. Moreover, patching the design usually requires major changes to any APIs.

Clever software architects use a modular design. This allows them to replace one section of the system without needing to rewrite everything. In my Jitter Bug example, I don't know how large the kernel patch was. For all I know, it could have tendrils that reach into large swaths of the kernel's code. However, because of the modular design, it didn't impact any kernel plugins, drivers, or programs that depend on the kernel.

As another example, my FotoForensics and Hintfo services explicitly use modular designs. I can replace, patch, or upgrade the web server, PHP engine, or any single analyzer without impacting anything else. Moreover, adding in new support for something like WebP or HEIC only required adding a new plugin for that file format. Because of my API definitions, I can enable new functionality without rewriting anything else.

C2PA uses a weakly-modular design. Yes, C2PA supports upgrading OpenSSL, but it can't replace OpenSSL with GnuTLS, NSS, wolfSSL, Schannel, or other SSL library without a major rebuild. While all of these libraries provide SSL/TLS functionality, they all use different APIs. And unfortunately, C2PA's libraries are hard-coded to use OpenSSL. In comparison, tools like wget and curl include an abstraction layer that translates from the program's needs to the SSL library's API. As a result, you can recompile curl to use a different SSL library without rewriting the code. And regardless of the SSL library, wget and curl will perform the same way.

With C2PA, the architecture is monolithic (not modular). This means that any new functionality requires a large code change and any flaw in the design impacts a large area. Addressing changes in c2patool seems to often be tied to code changes in c2pa-rs. There often isn't a quick fix to any of their bugs.

Design Defects

Designing complex software isn't easy. It requires conceptualizing all of the possible inputs, outputs, and interactions. Design defects often represent misunderstandings about the needs, technical limitations, and the trade-offs with any choices.

The architecture specifications in software are a lot like the script of a movie. A great movie script usually results in a great movie, regardless of how bad the actors are. (In my opinion, The Matrix is an incredible movie despite Keanu Reeves' cardboard performance.) In contrast, a bad script will result in a bad movie, regardless of how many exceptional actors appear in the film, like Cloud Atlas or Movie 43.

Many of my criticisms about C2PA are related to poor design decisions. It doesn't matter who implements it since they are implementing a flawed architecture. For example:

The C2PA specifications use X.509 certificates but explicitly ignores revocation. Quoting the specifications:

C2PA Manifests can be validated indefinitely regardless of whether the cryptographic credentials used to sign its contents are later expired or revoked.
It doesn't matter who implements the specs; they will all have problems if an attacker gets their hands on a valid certificate.

C2PA's design is completely based on "trust". No matter how closely the implementation follows the spec, the resulting code still assumes that the person inserting the metadata is trustworthy. It does nothing to deter dishonest people.

C2PA's specification says you can use a third-party time stamp authority for notarizing the data. However, the signed time stamp response does not prevent altering the time stamp! Again, it doesn't matter who implements this since the design uses the wrong data structure.

While implementation errors (bugs) can be patched, design flaws typically require a rewrite, and a monolithic solution with a design error often requires a major renovation. This is the definition of a fundamental design flaw: fixing the flaw basically results in a completely different product. Think of this like going from Windows 3.1 to Windows 11: Yes, they are both different versions of the Windows operating system, but there is a complete rewrite involved and almost zero compatibility between these versions. (You don't just need to change the operating system, you need to change every driver and program that interacts with the operating system.)

Fishy at LinkedIn

With great fanfare, LinkedIn's VP of Legal and Digital Safety Patrick Corrigan and C2PA announced that LinkedIn was deploying C2PA's Content Credentials:

LinkedIn has taken a significant step toward enhancing digital trust and transparency by adopting C2PA Content Credentials. With Content Credentials, LinkedIn users can easily identify AI-generated images and videos in their feeds.

Digital content with a "CR" icon will offer a transparent path back to its origins, revealing provenance information and any modifications made using AI.

This is a blatant lie. While it does display a "CR" icon on some digital content, it does not provide information transparency or anything trustworthy about the media's origins, provenance, or modifications made using AI.

Along with this announcement, LinkedIn began rolling out the support. If you upload a picture (or video) that contains C2PA metadata, then it displays a little "CR" icon that gives you additional information. However, the information it provides is both incorrect and unverified.

For example, I used Bing Image Creator to generate a picture of a goldfish.

According to Adobe's Content Credentials web service, it is an AI-generated image that was digitally signed by "Microsoft Corporation" on "Jun 7, 2024 at 9:37 AM MDT". It even reports that this picture was created using "Image Creator from [Microsoft] Designer". All of that is correct. (I didn't tamper with this picture.) If this picture had attribution information, such as naming a camera or photographer, then the web site would list those, too. (This is an AI-generated picture, so there is no photographer or camera.)

I then posted the same picture on LinkedIn. Currently, you have to be logged into LinkedIn to see the "CR" information. It looks like this:

LinkedIn is correct that "AI was used to generate part or all of this image." This declaration is because the metadata says it is an AI picture. LinkedIn is trusting that whoever generated this picture labeled it correctly.

LinkedIn says this came from "Microsoft Corporation", because that's what Microsoft self-declared in the unverified "Issuer" field.

However, LinkedIn shows the wrong date! LinkedIn says the picture is from "Feb 29, 2024" -- that's 99 days earlier! (As a reminder: I didn't tamper with this! This is backdated by LinkedIn!)

So what happed? LinkedIn has a bug (implementation defect). Rather than showing the date the file was created, they are showing the C2PA signing certificate's "Not Before" date. Using Hintfo, you can see that the certificate is valid from "Thu Feb 29 19:11:56 UTC 2024" to "Tue Dec 31 19:11:56 UTC 2024". Due to their programming error, they are showing the wrong date.

LinkedIn also has a few implementation oversights. For example:

Their implementation doesn't identify the photographer, camera, or anything else about media with C2PA metadata.

If the picture was partially altered with AI, then nothing in the C2PA manifest will identify exactly what part was altered. This is in direct contradiction to the C2PA announcement that claimed it can reveal "provenance information and any modifications made using AI."

Here's another example image that was posted on LinkedIn by Santiago Lyon (Adobe and Head of Advocacy and Education for the Content Authenticity Initiative; CAI):

His metadata didn't self-identify this as an AI-generated image, so there is nothing to indicate that it is AI-generated.

The signing certificate says it came from "Adobe Product Services G3". That's a generic certificate used to sign millions of files.

The date is not when the picture was captured or generated. That's the date when the signing certificate was first available for use. It must predate the image and likely predates it by almost a year.

Knowing who self-identified as the issuer, who self-declared that it is AI-generated, or what the certificate's "do not use before this date" tells you literally nothing about the authenticity and provenance of this picture. Moreover, the "Issuer" and same certificate were likely used to sign millions of pictures from different users. (As my coworker remarked: "This is like a store showing you the shirt label that only says 'Made in USA, wash cold, tumble dry' and not showing you anything else!" And remember: there are fraudulent and deceptive labels, so we need verification.)

The 800-Pound Gorilla

As a second test, I post-dated my sample Gorilla picture so that it appears to have been created in the future, on Oct 7, 2024:

According to Content Credentials, it's an AI picture from Microsoft that was created in the future, on "Oct 7, 2024 at 8:39 AM MDT". This date is not a bug; the implementation is doing what the spec says to do. This problem of displaying the fake date as legitimate is enabled by two different design defects:

C2PA doesn't validate the time stamp.
C2PA trusts that nobody altered the time stamp.

Fixing this problem requires a significant architectural redesign along with a code rewrite to fix both defects. Fixing "just one" of these two design flaws is not enough to prevent this type of forgery.

According to LinkedIn, this Gorilla picture was also created on "Feb 29, 2024". This is because of LinkedIn's date bug, showing the certificate's start date and not the signing date. But even if LinkedIn fixes their date bug, they would still show the altered date because C2PA cannot detect the alteration. (Fixing the "wrong date" bug, only to reveal the "trivial to forge date" architectural defect, is a wasted effort, like putting lipstick on a pig!)

Bearly Trustworthy

As a final test, I uploaded a complete forgery that uses a self-signed certificate:

Hintfo uses Adobe's 'c2patool' and GnuTLS's 'certtool' to evaluate the data. 'c2patool' says everything appears valid, while 'certtool' identifies the certificates as self-signed and untrustworthy.

Content Credentials issues a gray-box notice that the certificate is untrusted. (All self-signed certificates are untrusted by them.) However, it doesn't identify that any of the metadata is forged. If you didn't notice the little gray warning box at the top, then you wouldn't notice that the photographer's name, GPS, and other information is a forgery.

In contrast to these other services, LinkedIn says nothing. There is no "CR" logo, even though there is C2PA metadata. There is no indication that the media is a forgery, AI, or untrustworthy. This is an implementation oversight since it's not compliant with the C2PA specifications. (See the "C2PA User Experience Guidance for Implementers" documentation.)

In this case, LinkedIn has implementation defects and implementation oversights. But even if they fix these problems with patches, they will still have all of C2PA's fundamental architectural detects.

Independent Verification

LinkedIn re-encodes all files. This means that you cannot download the original (source) image from LinkedIn with the original metadata in order to independently validate it. (Feel free to try this: Upload a known picture with known metadata to LinkedIn, and then try to download the picture. The uploaded metadata will be removed from the downloaded file.) You cannot verify any of the "CR" findings on LinkedIn unless you have the pre-uploaded file.

For testing LinkedIn, I started with a known picture with known metadata (ground truth), and upload it to LinkedIn. Then I compared LinkedIn's claimed "CR" results against the expected results. That's what I did with the fish, gorilla, and bear pictures: I know the metadata and the expected results, and I compared it against the misinformation provided by LinkedIn.

However, this method only works with access to the known source material. Without the ability to download the unaltered source from LinkedIn and independently verify the results, you must trust LinkedIn's interpretation. In this case, we know that LinkedIn provides misleading information so we have no reason to believe that any other LinkedIn media with the "CR" feature is reliable.

Scope of the Problem

To summarized the problems covered in this blog, LinkedIn's CR feature provides:

Unverified "issue by" information.

Unverified information about whether a file is AI generated. It only appears if the provider of the image included it, and you don't know if they included the correct information.

Displays vague certificate information that has no forensic value.

Omits existing information, such as who claims the copyright or where the media was captured.

Displays the wrong "issued on" date. But even if they fix this, the date is unverified and not authenticated.

Fails to report cases where the C2PA metadata is known to be untrusted.

And prevents users from downloading the source media in order to independently validate the authenticity and provenance information.

LinkedIn's implementation of C2PA's Content Credentials has implementation defects and oversights on top of the existing fundamental flaws in the C2PA's overall architecture.

It should come as no surprise that LinkedIn would enable C2PA's Content Credentials. LinkedIn is owned by Microsoft and Microsoft is a founding member of C2PA. What is surprising is how inattentive they are to the code they released. I mean, the fundamental flaws in C2PA are one thing, but the LinkedIn-specific bugs really makes it appear as though nobody even thought about running this past a code review or having it tested by their quality assurance staff. Clearly someone should have noticed the wrong dates!

By itself, C2PA is untrustworthy because it is too easy to forge. But C2PA-on-LinkedIn introduces a whole new level of disinformation, with misleading time stamps, unspecified owners, and no indication that a picture is altered.

LinkedIn is not the only service rolling out C2PA with serious problems. However, no amount of patching will correct the fundamental design issues that permit "authenticated" forgeries.

For anyone who still thinks that the fundamental architectural flaws can be addressed with a patch: they are wrong. Any "fix" will require a radically different solution. If the folks at C2PA ever take this route, it will be a very different solution and incompatible with their current deployed code. Moreover, they are not discussing anything like this right now; they are continuing to promote, distribute, and deploy their known-flawed solution.

Security expert Marcus Hutchins recently remarked about a different security issue, where Microsoft was rolling out an extremely insecure feature. He concluded with a very apt generalization:

Tech is an endless cycle of companies rushing the most asinine and insecure technologies to market, then only seriously considering the advice of their security teams after everything goes wrong.

I fully agree. C2PA and LinkedIn is yet another example of this.

Read more about AI, Authentication, Forensics, FotoForensics, Image Analysis, Mass Media, Politics, Security | Comment (1) | Direct Link

Potato Potahto

Friday, 24 May 2024

For the last few years, I've been fielding a steady flow of questions at FotoForensics from students who want to create AI-based analyzers. Each time, they say they want to use deep-learning to determine if a picture is real or fake. Unfortunately, I tell each of them the same thing: photo analysis is an incredibly difficult topic. Outside of extremely niche uses, there will be little or no accuracy. While I wish them well, simply training on a bunch of ELA images, or parsing a lot of metadata, isn't going to result in a good detector. The main limiting factor is the context. You need to understand what is in the picture, how the various metadata fields relate to the content, and how they combine to impact impact things like the coloring and image compression. Current deep learning systems cannot do that type of cross-correlation today.

Many of the public AI chat bots have recently introduced the ability to upload pictures for analysis. People can ask questions like "Is this picture real?" or "What can you tell me about this photo?" The results are laughably inaccurate. Unfortunately, too many people are taking the AI's responses seriously.

Signature Recipe

I decided to take a closer look at these online analyzers from Google, Microsoft, etc. For testing, I collected a set of pictures that where I controlled the content and modifications; I know the ground truth for this data set. I also included some AI-generated picture. For example, I asked Google Gemini, "Can you draw me as a potato?" Gemini responded: "Sure, here you are as a potato!"

As one of my coworkers remarked, "The resemblance is uncanny!" We had a good laugh about it, until I started doing an evaluation of the image. The thing that really stood out as unexpected was the bright spot in the lower-right corner when viewing the JPEG compression map (error level analysis; ELA):

A close-up of the lower-right corner shows that the AI tried to generate an artist's signature!

I don't know what data Google trained their system on, but I am sure that it includes lots of artworks. Gemini appears to have tried to recreate some artist's signature on this AI-generated image. Most likely the signature is a combination of a couple of signatures, with one dominating the others. That's why it kind of looks like letters without spelling anything. Somewhere out there is a real signature that probably looks similar to this.

I asked Gemini about the signature, and the reply (if it were human) sounds totally suspect:

Neal: You drew a potato with a signature in the lower right corner. Who's signature is that?

Gemini: You're right, I apologize for the mistake. As a large language model, I can't actually draw and accidentally included a detail that isn't possible. There wouldn't be a signature on an image I generate.

Would you like me to try describing you as a potato again, without the signature?

Since the AI-generated potato appears to use someone's signature, I have to wonder what else about it isn't original? Is the potato image a variation of some other artist's photo or painting of a potato? Is the layout of the potato on a wooden desk with dark lighting copied from some other artist?

OpenAI was recently accused of using actress Scarlett Johansson's voice without permission. Specifically, the CEO of OpenAI had asked for permission, was explicitly told "no", and then used her voice anyway. While my potato picture isn't the same as Scarlett Johansson's voice, it demonstrates the same problem of AI companies appearing to use content without permission.

Mashed

I recently saw a video on YouTube where Nathan Freitas tried to evaluate an image using ChatGPT. In the video, the AI system lies about the image (saying there is no metadata) and reaches some questionable results after he pasted the metadata into the chat system for evaluation.

Freitas isn't the only person trying to use these sytems to evaluate pictures. I've had personal communications with people who have used other AI systems and wanted a second opinion. My second opion? Trust nothing that the AI system says. Even if the results are partially correct, they often skip over minor artifacts that result in significant changes to the interpretation.

To test how these AI systems evaluate content, I tried a similar experiment using Google Gemini and Bing Copilot (Bing Chat).

For starters, I uploaded the picture of my network brick and asked "What can you tell me about this picture?"

Bing Copilot did something really odd. First, it began writing a long reply and then erased it and said, "Sorry! That’s on me, I can’t give a response to that right now. What else can I help you with?" I repeated the question and it provided a detailed reply:

Certainly! The image you’ve shared depicts networking equipment, specifically a network switch and other related hardware. Here are the details:

Network Switch and Hardware:
The central focus of the image is a red brick-like object with Ethernet ports. This object is designed to resemble a brick but serves as networking hardware.

The brick-like device has two Ethernet ports, one of which is connected to a yellow Ethernet cable.

Adjacent to the brick-like object, there’s a blue network switch with multiple ports. Some of these ports are connected with white cables.

The network switch features labels and indicators, including port numbers and speed indicators.

To the left of the network switch, there’s another piece of equipment with colorful knobs and LEDs. However, due to the angle of the photo, it’s not entirely clear what this device is.

Additional Information:
The background of the image is dark, emphasizing the networking equipment.

An optical character recognition (OCR) snippet is visible in the lower part of the image, showing port speeds (1000M, 100M, 10M) and labels related to WAN and the Internet.
Please feel free to ask if you’d like more information or have any other questions! 😊

Part of Bing Copilot's description is accurate: the fake brick is designed to look like a brick, it has a yellow network connector, and it's adjacent to another blue switch. However, it was more wrong that right:

The brick does not have "two Ethernet ports"; it has 5 ports and 2 are in use.

The blue switch (well, technically a router) does not have multiple white cables. It only has one white cable. The other cables are blue and red. The gray router has multiple white cables.

There is nothing to the left that has knobs.

The background is dark because it's a computer cabinet. This was not intentionally done to emphasize the networking equipment, even if that was the final outcome. The emphasis comes from the network equipment being in focus.

Google Gemini had a similarly descriptive result:

The image you sent appears to show a network switch with a variety of ports and statuses. Here’s a breakdown of the image:

Ports 1-5 and 6-8: These are all standard RJ45 Ethernet ports, which are used to connect devices to the network switch using Ethernet cables.

Speed indicators: The text next to each port indicates the maximum speed that the port can support. For example, the text “100M” next to port 1 indicates that the port can support speeds of up to 100 megabits per second (Mbps).

Link and Activity indicators: The lights next to the ports may also indicate the status of the connection. A lit light might mean that the port is connected to a device and that data is flowing through the port.

RGB: It is unclear what the the text “RGB” refers to in this context.

N: It is unclear what the text “N” refers to in this context.

WAN: WAN stands for Wide Area Network. A WAN is a network that spans a large geographical area, such as a city or country. WANs are typically used to connect smaller networks, such as Local Area Networks (LANs).

Inter: It is unclear what the text “Inter” refers to in this context.
Overall, the network switch in the image appears to be a fairly basic device that is used to connect multiple devices to a network. The specific functionality of the switch will depend on the make and model of the device.

Gemini definitely did better with the text, even if it didn't understand it. I have tags on every cable describing what it's plugged into. For example, the red connector goes from a computer named "RGB" to the "WAN" and the white goes from the "WAN" to the "Internet" bridge. However, I don't fault Gemini for not knowing my labeling system.

Gemini had some great insights, such as decoding the lights on the routers with regards to network speed. However, it also made some big assumptions about these routers. For example, it assumed that they are standard RJ45 connectors. And the list of ports (1-5 and 6-8) are really ambiguous. Are they referring to one router (1-8) or some combination of the top, middle, or lower routers?

Potato-Headed

Many of my pictures were rejected by Google Gemini. Basically, if it looks like a person, then it removes the image and says "Sorry I can't help with that image. Try uploading another image or describing the image you tried to upload and I can help you that way." For example, I tried to upload a picture of my AI-generated baby gorilla picture, but it rejected the image; I assume that it looked too much like a person.

Gemini also had trouble when I uploaded the potato picture. Keep in mind, Google Gemini generated this picture a few minutes before I uploaded it back to Gemini:

Neal: Is this picture real?

Gemini: The image you sent me appears to be a hyperrealistic painting of a potato on a wooden table. While the image is very convincing, there are a few details that suggest it is not a real photograph.

The perfect lighting on the potato and the lack of shadows cast by the potato on the table are unlike what you would typically see in a photograph.

The absence of any imperfections on the potato skin or the wood grain of the table also suggests it may not be a real potato.
I can't definitively say the image is not real without more information about the source of the image. However, based on the details I can see, it is more likely a painting than a photograph. You can use a reverse image search to see if you can find other versions of the image online. This may help you determine where the image originated from and whether it is a painting or a photograph.

Sadly, Gemini cannot detect AI-generated images that come from Gemini. Moreover, Gemini thinks this Gemini-generated image is a painting. (And paintings are made by artists!)

What has eyes but can't see? A potato!

I asked the same question about the potato picture to Bing Copilot:

Neal: Is this picture real?

Copilot:
Analyzing the image: Privacy blur hides faces from Copilot

The image you’ve shared appears to be a digitally created illustration rather than a photograph. It features a potato-like object with a rough, brownish skin, resting on a wooden surface. Unfortunately, due to the blurred section obscuring part of the image, I cannot definitively determine its authenticity.

Cool -- Copilot thinks it's a digital creation. However, it reached that conclusion because of a blurred region in the middle of the image. Wait! There is no blurred region in the picture I uploaded! I asked Copilot about this:

Neal: What part of the selection is blurred?

Copilot: I apologize for any confusion. The blurred section in the image appears to be a large rectangle that obscures a significant portion of the object. Unfortunately, without further context, it’s challenging to determine the purpose or content of that obscured area.

At the beginning of Copilot's response, it mentioned (in light gray text) that Copilot blurs faces for privacy. I think Copilot's pre-processor automatically blurred the potato and Copilot's image analyzer didn't realize that.

Deep Fried

For a final test, I pasted some EXIF data into the chat and asked each system, "What can you tell me about this metadata?"

Make                            : Apple

Camera Model Name               : iPhone 12 mini

Orientation                     : Rotate 90 CW

X Resolution                    : 72

Y Resolution                    : 72

Resolution Unit                 : inches

Software                        : 17.5

Modify Date                     : 2022:05:20 14:43:00

Host Computer                   : iPhone 12 mini

Y Cb Cr Positioning             : Centered

Exposure Time                   : 1/60

F Number                        : 1.6

Exposure Program                : Program AE

ISO                             : 250

Exif Version                    : 0232

Date/Time Original              : 2022:05:20 14:43:00

Create Date                     : 2022:05:20 14:43:00

Offset Time                     : -06:00

Offset Time Original            : -06:00

Offset Time Digitized           : -06:00

Components Configuration        : Y, Cb, Cr, -

Shutter Speed Value             : 1/60

Aperture Value                  : 1.6

Brightness Value                : 1.452416421

Exposure Compensation           : 0

Metering Mode                   : Multi-segment

Flash                           : Off, Did not fire

Focal Length                    : 4.2 mm

Subject Area                    : 2009 1502 2208 1387

Sub Sec Time                    : 000

Sub Sec Time Original           : 000

Sub Sec Time Digitized          : 000

Flashpix Version                : 0100

Color Space                     : Uncalibrated

Exif Image Width                : 4032

Exif Image Height               : 3024

Sensing Method                  : One-chip color area

Scene Type                      : Directly photographed

Exposure Mode                   : Auto

White Balance                   : Auto

Focal Length In 35mm Format     : 26 mm

Scene Capture Type              : Standard

Lens Info                       : 1.549999952-4.2mm f/1.6-2.4

Lens Make                       : Apple

Lens Model                      : iPhone 12 mini back dual wide camera 4.2mm f/1.6

Composite Image                 : General Composite Image

GPS Latitude Ref                : North

GPS Longitude Ref               : West

GPS Time Stamp                  : 20:59:54

GPS Date Stamp                  : 2024:05:20

GPS Horizontal Positioning Error: 1 m

Compression                     : JPEG (old-style)

Thumbnail Offset                : 2840

Thumbnail Length                : 2616

Thumbnail Image                 : (Binary data 2616 bytes, use -b option to extract)

Google Gemini gave a superficial answer:

The metadata you provided reveals a lot of details about a photo taken with an iPhone 12 mini. Here's a breakdown of the interesting bits:

Device Information:

Make: Apple

Camera Model Name: iPhone 12 mini
Image Properties:
Orientation: Needs to be rotated 90 degrees clockwise for proper viewing.

Resolution: 72 pixels per inch (not very high resolution, possibly a screenshot or resized image).

Software Version: 17.5 (likely the iOS version on the iPhone).

Date Taken: May 20, 2022, at 2:43 PM MDT (with offset information included).
Camera Settings:
Exposure Time: 1/60th of a second (relatively fast shutter speed).

F Number: 1.6 (wide aperture, good for low light).

ISO: 250 (relatively low ISO, indicates good lighting conditions).

Exposure Program: Program AE (camera determined the settings automatically).

Focal Length: 4.2 mm (wide angle lens).
Additional Details:
Flash: Not fired.

Metering Mode: Multi-segment (camera analyzes different parts of the scene to set exposure).

Color Space: Uncalibrated (colors may not be entirely accurate).

Original Image Size: 4032 pixels wide by 3024 pixels tall (much higher resolution than displayed).

GPS: Not available (although some reference points exist, there's no data).

Image Compression: JPEG (common image format).
Overall, this metadata provides a good technical picture of the photo. It seems like a photo taken with an iPhone 12 mini in good lighting conditions, possibly indoors. The low resolution (72 ppi) suggests it might be a screenshot or a resized image.

Some of the information is pretty straightforward, such as the stated camera make and model. But other findings are wrong or misleading. For example:

The "72 pixels per inch" has absolutely nothing to do with whether it is a photo or screenshot. (All iPhones use 72 ppi resolution.)

If it's going to determine that the make is "Apple" because it says "Apple" (taking metadata at face value), then it should assume that the software version (17.5) is the software version. Why does it think the software version is "likely" when the make is absolute?

Gemini equated the ISO level with the lighting conditions. Nope -- ISO refers to the sensor sensitivity. The auto-exposure program combines ISO with exposure and F-number based on the lighting conditions. An expert can make assumptions about the lighting conditions based on these combined values, but not based on the ISO alone. If this result came from a person, I'd assume that they skipped the part about how camera sensors work.

Gemini says that the image size, 4032x3024, is "much higher resolution than displayed". Really? Gemini couldn't evaluate the image's metadata when I uploaded the image, so I started a new chat and only pasted the metadata. Gemini never displayed a picture. In addition, the EXIF image size has nothing to do with the displayed resolution. (I have plenty of sample pictures where the JPEG dimensions differs from the EXIF dimensions.)

This is far from everything that Gemini missed. With a deeper analysis of just this metadata, an expert can tell that the metadata in this example was intentionally altered.

Bing Copilot performed even worse. With the same EXIF data, it simply recited back the metadata values. (But it did show me a row of ads for iPhones!)

Copilot: Certainly! Let’s dive into the metadata from your iPhone 12 mini photo. Here are some key details:

Make: Apple

Camera Model Name: iPhone 12 mini

Orientation: Rotated 90 degrees clockwise

Resolution: X and Y resolutions are both 72 pixels per inch (PPI)
...

Thumbnail Length: 2616 bytes
Feel free to ask if you’d like more information about any specific aspect!

Neal: Is this picture altered?

Copilot: The metadata you provided doesn’t explicitly indicate whether the picture has been altered. However, based on the information available, we can’t definitively determine if any post-processing or alterations were applied. If you have any specific concerns or questions about the image, feel free to share more details, and I’ll do my best to assist!

Nope -- the metadata definitely denotes alterations.

With any kind of forensic analysis, reliability and repeatability are critical. I pasted the same metadata into both Google Chat and Bing Copilot multiple times. Both systems returned different results each time. Moreover, the analysis and incorrect portions changed each time. Their results are inaccurate, unrepeatable, unreliable, and undependable.

Meat and Potatoes

There are different types of AI systems. Large language models (LLMs) attempt to generalize language constructs based on a massive amount of input text. In contrast, deep neural networks (DNNs) are often used for image (or voice) generation and object detection.

Bing Copilot, Google Gemini, and OpenAI's ChatGPT combine LLM and DNN components. When you type a text query into any of these systems, the LLM tries to interpret the query and reply. When you ask it to generate a picture or to describe the contents of a picture, then it uses the LLM to decipher the inquiry and the DNN to generate or classify the visual content.

These LLM+DNN systems do serve legitimate purposes. For example, they can spur brainstorming and help with the creative process. However, they are not a panacea for every AI problem. Even the best of these systems do not reply with authoritative responses. (They may use an authoritative writing style, but all results must be fact-checked. And fact-checking does not mean 'compare against another LLM.')

LLMs and DNNs are very good at memorization and recall with variation. Unfortunately, they don't check their work. As a result, they can spout false statements without realizing it, mismatch factual statements in a way that is inaccurate, and replicate an artist's style without attribution.

The Arran Pilot Potato

Forensics are used to evaluate evidence for use in a court of law. Because of this requirement for factual and accurate results, my FotoForensics services do not use any LLM or DNN systems; those technologies are not reliable enough. This doesn't mean that I don't use AI. In fact, it's quite the contrary. However, I rely on other types of AI, such as expert systems and Bayesian networks, where the results are deterministic, repeatable, and provable.

Even something like metadata analysis is a good fit for some kinds of AI.

The commercial FotoForensics service (not on the public site) includes an expert system that evaluates metadata. Even without visual content, it can reliably detect inconsistencies, alterations, and information about the content. As an expert system, it is literally designed using knowledge from human experts. While I know all of the little tips and clues to look for in the metadata, the expert system is faster than me and won't accidentally overlook some subtle clue.

The LLMs and DNNs used by Google, Microsoft, and OpenAI were never trained by subject matter experts. Unfortunately, their developers appear to be putting more effort into expanding their LLM and DNN services than fact-checking the generated content. As a result, any metadata analysis appears as a regurgitation and often includes factually inaccurate responses. In effect, they are giving bogus analysis results to users who have legitimate needs. Any results that appear correct could be correct by coincidence. (Try the same query again! The inherent "create a variation" in LLMs will likely introduce other errors in repeat analysis attempts.)

With photo analysis, it's never a "build once and done" situation. Each new camera and smartphone OS update may introduce new artifacts that require changes to the analyzer. A human maintaining an expert system may require a few hours a months to revise the system based on new findings. An equivalent LLM system would demand much more time and effort to revise and validate, assuming that the LLM developers do any validation.

I keep encountering people who use these LLM+DNN systems to perform photo analysis. Because Google, Microsoft, and OpenAI don't actively warn users about any inaccuracies, people are often left with the wrong impression; they may think a real photo is altered or an altered photo is real. Until these companies demonstrate that their AI results are reliable and accurate, these LLM tools should never be used to generate any kind of forensic evaluation.

It isn't that these LLM+DNN systems can't eventually become great automated detection systems. The problem is that we're nowhere near that technological level. (My gut tells me that we're more than a decade away.) Remember when you were a child and pretended that a cardboard box was an airplane or a spaceship? These LLM systems are at that level of technology right now. No matter how nicely you decorate it or how accurately you reproduce the controls, it still doesn't fly. But if you use your imagination and suspend disbelief, then these automated results from Google, Microsoft, and OpenAI look pretty good.

Read more about AI, Forensics, FotoForensics, Image Analysis, Programming | Comments (2) | Direct Link

(Page 1 of 206, totaling 1027 entries) next page »

About

Popular Posts

Tools

Links

Calendar

Archives

Feeds

Categories