AngleSharp is a powerful cmdlet that can be used to parse webpages. It can be used to extract information from webpages, such as the title, the author, and the date of creation. To use AngleSharp in PowerShell 7, you first need to install it. You can find it on the Microsoft website. Once you have installed it, you can use the following command to start it up: Install-Module AngleSharp Once AngleSharp is started, you can use its various functions to parse webpages. The following example uses one of AngleSharp’s functions, getTitle(), to extract the title of a web page: Get-WebPage -Title “My Page” -Author “John Doe” -DateCreated “2012-12-15T12:00:00Z”
AngleSharp is a .NET library that makes parsing and working with HTML content quick and easy. As AngleSharp is written in .NET, you can use and consume the output in PowerShell as well. Combining these two allows you to quickly and easily script HTML content. In this article, we will explore how to set up AngleSharp and consume a weather page, and convert the data into a PowerShell object.
Installing and Loading AngleSharp
Installing AngleSharp is easy using the Install-Package command. You can even install the package into the CurrentUser scope which means you do not need administrative rights to use this library. The package is contained in the NuGet library.
Next we will want to load AngleSharp for use in our PowerShell script. To do this we will want to use the Add-Type cmdlet to directly load the DLL for the library. Below is code that assists in locating the latest .NET version and loading that path, if the library isn’t already loaded in your session.
Read on to discover how to parse the webpage content and create a useful PowerShell object!
Parsing a Webpage
Of course, the whole point of this is to actually parse a web page. In this example we will load the content from Invoke-WebRequest and then using the result, parse the content in AngleSharp. We are going to use a local 7-day forecast from the National Weather Service to pull in weather data and convert to an object. First, let’s retrieve the weather data.
The data for the site that we are interested in is in the Content property, but it is the full HTML source, which is a lot to process. Often, it is easiest to use Chrome Developer Tools to locate the section of the HTML source that we want to use (F12 in Chrome for the site you want to inspect).
The HTML structure of the NWS weather page.
Thankfully, there is a div container with an unordered list that we can parse. The next step is to actually load the retrieved content into AngleSharp.
Now that we have the parsed content available in our $Parsed variable, we can start to manipulate this data to get to just the section we want. Very conveniently, the NWS site provides an ID just for this unordered list named seven-day-forecast-list. Since each ID is unique on an HTML page, this makes the list easy to target. Using the All property on our parsed content, we can retrieve just the object with the ID of seven-day-forecast-list.
This will result in a lot of different properties, but we are focused on the ChildNodes property as it will contain each li containing the data we need. To get an idea of what we are looking to target in our object, let’s take a look at an individual li. There are a handful of elements with classes that we can target.
period-name – The relative time period. short-desc – A condensed description of the weather. temp temp-high – The high temperature.
HTML structure of a single tombstone-container element.
You may notice that the img tag contains an alt property with a lot of useful information. It’s pretty easy to find the class to target as it is stored in the classname property of the child node. To target the alt element we will have to rely on a slightly different method, QuerySelectorAll which uses traditional CSS selectors to make complex targeting easy.
Output of the parsed web page from AngleSharp.
Although we have to iterate over a few elements to ultimately get to just the ones we want, we can walk through the HTML document structure and get to just what we need. It can be a bit tricky to understand the structures, but ultimately what AngleSharp is doing is creating objects for each DOM element. Once you figure out the best way to target the elements you need, extracting the content is not difficult.
Conclusion
AngleSharp offers an excellent programmatic interface to parsing and interacting with HTML content on webpages. This can open the door to using PowerShell to retrieve content that may be otherwise inaccessible. Taking this content, storing it, and using it in scripts is extremely useful and can help aid system integration methods!