There are different approaches to scrape data from a web site automatically. Saving the site and analyzing it locally is one such solution. Power Automate Desktop has a very strong feature here, inside the “Get details of web page” action: Get web page source.

Combined with the “Write text to file” action it allows you to save web pages locally for further processing. Here is a schematic overview of how a flow and its subflows could be structured for such a task.
Main
Creates the working directory for the files to be kept (e.g. based on the date)
Contains a flat list of subflow running actions:
- Run subflow SavePages (Running time: 3:00 for 25 pages)
- Run subflow Save2ndPages (Running time: 2:30 min for 20 pages)
- Run subflow Pages2TSV (Running time: A couple of seconds)
- Run subflow FilterTSV
- Run subflow ProductTSV
- Run subflow TSV2XLSX
- Run subflow FilterTSV2XLSX
- Run subflow ProductsTSV2XLSX
Each subflow depends on data created by a previous flow, but still it should be possible to run each flow individually during creation and debugging to save time. This can be easily achieved by disabling/enabling the subflows here (e.g to debug Pages2TSV here you don’t have to wait 5:30 minutes every time, but the subflow accesses the data that was created before locally).
SavePages
- Input: A text file with the URLs of the web pages to scrape
- Output: The respective web pages as .html files in the working directory plus a second text file if secondary pages exist that need to be scraped
The subflow uses an browser instance, which will be closed by the end of the subflow. If the website loads content dynamically, cursor down keystrokes can be send to the page to ensure all elements are loaded. If the amount of elements on the web page is dynamic and determines if there is a second page, this will analyzed in this flow, too.
Save2ndPages
- Input: A text file with the additional URLs of the web pages to scrape
- Output: The respective web pages as .html files in the working directory
Pages2TSV
- Input: The working directory with the saved .html files
- Output: A .tsv (tab separated values) file with the scraped content of the web pages
This is where the magic is happening and this part will be individual for different web sites. The code will contain certain identifiers for each item you want to create e.g. >#1< >#2< >#3< so with the help of the “Parse text”, “Crop text”, “Get subtext” and some loops the relevant information needs to be fetched and be written into the .tsv file.
FilterTSV
- Input: The .tsv file from the subflow before, a .txt file with keywords to filter for
- Output: A smaller .tsv file
This is an optional subflow that helps to condense the data created before into a smaller file with only the relevant items.
ProductTSV
- Input: The big .tsv file
- Output: A smaller .tsv file
This is an optional subflow that filters down to one specific vendor and creates a flat of the products of this vendor.
TSV2XLSX
- Input: The big .tsv file
- Output: A nicely formated and named Excel sheet
This takes the data from the big .tsv file, creates an Excel sheet with different worksheets and puts the data nicely inside.
FilterTSV2XLSX
- Input: The filtered .tsv file
- Output: A nicely formated and named Excel sheet
ProductsTSV2XLSX
- Input: The products .tsv file
- Output: A nicely formated and named Excel sheet
Leave a Reply