Looking to download the URL's for the Images under the Fire Fox Media Tab under the View Page Info preferably using power shell. Not sure if this is possible or if there is a better way to do this.
2 Answers
Some web sites/properties or segments thereof do not allow/prevent automation and there is nothing that can be done about that.
Btw, you don't need a browser to download website data, this is known as web scraping of course and this is done using the PowerShell web cmdlets, specifically...
# Get specifics for a module, cmdlet, or function
(Get-Command -Name Invoke-WebRequest).Parameters
(Get-Command -Name Invoke-WebRequest).Parameters.Keys
<#
# Results
UseBasicParsing
Uri
WebSession
SessionVariable
Credential
UseDefaultCredentials
CertificateThumbprint
Certificate
UserAgent
DisableKeepAlive
TimeoutSec
Headers
MaximumRedirection
Method
Proxy
ProxyCredential
ProxyUseDefaultCredentials
Body
ContentType
TransferEncoding
InFile
OutFile
PassThru
Verbose
Debug
ErrorAction
WarningAction
InformationAction
ErrorVariable
WarningVariable
InformationVariable
OutVariable
OutBuffer
PipelineVariable
#>
Get-help -Name Invoke-WebRequest -Examples
<#
# Results
$R = Invoke-WebRequest -URI
$R.AllElements | where {$_.innerhtml -like "*=*"} | Sort {
values. Sorting by the shortest HTML value often helps you find the
$R=Invoke-WebRequest http://www.facebook.com/login.php
$FB
$Form = $R.Forms[0]
$Form | Format-List
$Form.fields
$Form.Fields["email"]="[email protected]"
$R=Invoke-WebRequest -Uri ("https://www.facebook.com" +
# Sends a sign-in request by running the Invoke-WebRequest
$R.StatusDescription
(Invoke-WebRequest -Uri "http://msdn.microsoft.com/en-us/library
#>
Get-help -Name Invoke-WebRequest -Full
Get-help -Name Invoke-WebRequest -Online
So, for the URL you say you are tyring to hit, note you get results for ...
# Download website main page
($InstacartHomeData = Invoke-WebRequest -Uri 'https://www.instantcart.com')
<#
# Results
StatusCode : 200
StatusDescription : OK
Content : <!DOCTYPE html><html lang="en" class="no-js"><head><link rel="alternate"
href="https://www.instantcart.com/" hreflang="en-gb" /><link rel="alternate"
href="https://www.instantcart.com/" hreflang="en" ...
RawContent : HTTP/1.1 200 OK
Pragma: no-cache
Vary: Accept-Encoding
Connection: close
Transfer-Encoding: chunked
Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
Content-Type: text/...
Forms : {}
Headers : {[Pragma, no-cache], [Vary, Accept-Encoding], [Connection, close], [Transfer-Encoding, chunked],
[Cache-Control, private, no-cache, no-store, proxy-revalidate, no-transform], [Content-Type,
text/html], [Date, Thu, 28 May 2020 04:23:28 GMT], [Expires, Thu, 19 Nov 1981 08:52:00 GMT],
[Set-Cookie, sid=b806f71e100b9f2d4d1037561b53ff65; path=/; domain=www.instantcart.com], [Server,
Apache], [X-Powered-By, PHP/5.5.38]}
Images : {@{innerHTML=; innerText=; outerHTML=<img width="160" class="img-responsive"
...
#>
# Get only images data
$InstacartHomeData.Images | Select-Object alt, src
<#
# Results
alt src
--- ---
/pics/logo.png
Abode Home Products /images/home/clients/abode-home-products.png
Avanta UK /images/home/clients/avanta-uk.png
Q-Park /images/home/clients/qpark.png
...
#>
Now, make the same attempt for your target page.
# Download website specific main page
($InstacartProductPageData = Invoke-WebRequest -Uri 'https://www.instacart.com/products/98954-poland-spring-natural-spring-water-2-5-gal')
<#
# Results
# Cookie are used to get this
StatusCode : 200
StatusDescription : OK
Content : <!DOCTYPE html>
<html lang='en'>
<head>
<title>
Poland Spring Natural Spring Water (2.5 gal) - Instacart
</title>
<meta content='Buy Poland Spring Natural Spring Water (2.5 gal) online and have it de...
RawContent : HTTP/1.1 200 OK
Transfer-Encoding: chunked
Connection: keep-alive
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
X-Download-Options: noopen
X-Permit...
Forms : {}
Headers : {[Transfer-Encoding, chunked], [Connection, keep-alive], [X-Frame-Options, SAMEORIGIN],
[X-XSS-Protection, 1; mode=block]...}
Images : {@{innerHTML=; innerText=; outerHTML=<img class="rmq-569a8dd6" style="background: rgb(255, 255,
...
Poland Spring 100% Natural Spring Water
2.5 gal; outerHTML=<a style="text-decoration: none;"
href="/products/16965376-poland-spring-100-natural-spring-water-2-5-gal" data-radium="true"><div
class="rmq-cd8b1370 rmq-5e34cd3" style="padding: 0px 16px; width: 208px; height: 100%; text-align:
left; line-height: 1.29; font-size: 14px; display: flex; position: relative; opacity: 1;
flex-direction: column;" data-radium="true"><div class="rmq-24058c4e" style="width: 176px; height:
176px;" data-radium="true"><img style="width: 100%; display: block;" alt="" src="https://d2d8wwwkmh
fcva.cloudfront.net/352x/d1s8987jlndkbs.cloudfront.net/assets/missing-item-4bbe82b8555e4d1c12626fd4
82cb2409713e8e30835645ff3650ef66a725d03c.png" data-radium="true"></div><div style="padding-bottom:
8px; margin-top: auto;" data-radium="true"><div class="rmq-50e196af" style="color: rgb(66, 66,
66); overflow: hidden; margin-top: 20px; -ms-text-overflow: ellipsis; max-height: 55px;"
data-radium="true">Poland Spring 100% Natural Spring Water</div><div style="color: rgb(117, 117,
117);" data-radium="true"><span>2.5 gal</span></div></div></div></a>; outerText=
...
#>
# Get only images data
$InstacartProductPageData.Images | Select-Object alt, src
<#
# Results
alt src
--- ---
Instacart logo https://d2guulkeunn7d8.cloudfront.net/assets/beetstrap/brand/carrotlogo-p...
Poland Spring Natural Spring Water https://d2lnr5mha7bycj.cloudfront.net/product-image/file/large_f44f2f09-b...
Gala Fresh logo https://d2lnr5mha7bycj.cloudfront.net/warehouse/logo/162/0f5c96be-4126-45...
...
#>
Please see the below that uses internet explorer to render the page, the image locations are then stored within the document property.
Adjust the output directory and the website to what you need.
I have not tested that the results of this are the same as what firefox lists but it is very likely to produce the same.
$OutputDirectory = "c:\test\images.txt" # change this to the output directory and txt file name, ensure it ends with .txt
$Weppage = "https://www.somewebsite.com" # change this to the webpage you want
$ieObject = New-Object -ComObject 'InternetExplorer.Application'
$ieObject.Visible = $false
$ieObject.Navigate($Weppage)
while($ieObject.ReadyState -ne 4) {start-sleep -m 100}
$images = $ieObject.Document.images | % {$_.src}
$images | Out-file $OutputDirectory
$ieObject.quit()
-
This works very well on most sties however I am having issues with it on this site: instacart.com/products/… not sure what would cause it not to work. Commented May 27, 2020 at 18:26