Fetching latest headlinesโ€ฆ
๐Ÿ”— Build a Link Extractor & Broken Link Checker (Python + PySide6)
NORTH AMERICA
๐Ÿ‡บ๐Ÿ‡ธ United Statesโ€ขApril 19, 2026

๐Ÿ”— Build a Link Extractor & Broken Link Checker (Python + PySide6)

0 views0 likes0 comments
Originally published byDev.to

In this tutorial, weโ€™ll build a desktop app that:

โœ… Extracts links from files (.txt, .pdf, .html)
โœ… Filters links (include/exclude keywords)
โœ… Checks if links are broken
โœ… Displays results with colors (๐ŸŸข working / ๐Ÿ”ด broken)
โœ… Uses a modern GUI with PySide6

๐Ÿ“ฆ Step 1: Install Dependencies

First, install required packages:

pip install PySide6 requests PyPDF2

๐Ÿง  Step 2: Import Required Libraries

We start by importing everything we need:

import os
import sys
import re
import requests
import time
import platform
import subprocess

from PySide6.QtWidgets import *
from PySide6.QtCore import Qt, QThread, Signal, QTimer
from PySide6.QtGui import QColor, QIcon, QGuiApplication

import PyPDF2

๐Ÿ’ก Explanation:
os, re โ†’ file handling + regex
requests โ†’ check links
PySide6 โ†’ GUI framework
PyPDF2 โ†’ extract text from PDFs
๐Ÿงต Step 3: Create a Background Worker (QThread)

We use a thread so the UI doesnโ€™t freeze while scanning.

class LinkWorker(QThread):
    found = Signal(str, bool)
    progress = Signal(int)
    finished = Signal()

๐Ÿ’ก Why?

GUI apps must stay responsive, so heavy work runs in a thread.

๐Ÿ” Step 3.1: Initialize Worker

def __init__(self, folder, file_types, check_broken, include_words=None, exclude_words=None):
    super().__init__()
    self.folder = folder
    self.file_types = file_types
    self.check_broken = check_broken
    self.include_words = include_words or []
    self.exclude_words = exclude_words or []
    self.seen_links = set()
    self._running = True

๐Ÿ’ก Features:
Avoid duplicate links
Support include/exclude filters
Allow stopping process
๐Ÿ“‚ Step 3.2: Scan Files

def run(self):
    all_files = []

    for root, _, files in os.walk(self.folder):
        for f in files:
            ext = os.path.splitext(f)[1].lower()

            if (ext == '.txt' and self.file_types['txt']) or \
               (ext == '.pdf' and self.file_types['pdf']) or \
               (ext in ['.html', '.htm'] and self.file_types['html']):
                all_files.append(os.path.join(root, f))

๐Ÿ’ก What happens:
Recursively scans folders
Filters only selected file types
๐Ÿ”— Step 3.3: Extract Links

urls = re.findall(r'https?://[^\s"\'>]+', text)

๐Ÿ’ก Regex explained:
Matches http:// or https://
Stops at spaces or quotes
๐Ÿ“„ Handle PDF Files

reader = PyPDF2.PdfReader(f)
for page in reader.pages:
    text = page.extract_text()

๐ŸŽฏ Step 3.4: Apply Filters

if self.include_words and not any(w in url for w in self.include_words):
    continue

if self.exclude_words and any(w in url for w in self.exclude_words):
    continue

๐Ÿ’ก Example:
Include: google
Exclude: facebook
๐ŸŒ Step 3.5: Check Broken Links

def check_link(self, url):
    try:
        res = requests.get(url, timeout=10)
        return not (200 <= res.status_code < 400)
    except:
        return True

๐Ÿ’ก Logic:
200โ€“399 โ†’ OK
400+ โ†’ broken
๐Ÿ–ฅ๏ธ Step 4: Build the GUI

Create the main window:

class LinkApp(QWidget):
    def __init__(self):
        super().__init__()
        self.setWindowTitle("LinkGuardian")
        self.setMinimumSize(1000, 600)

๐Ÿ“ Step 4.1: Folder Selection

self.path_input = QLineEdit()
self.path_input.setReadOnly(True)

browse_btn = QPushButton("Browse")
browse_btn.clicked.connect(self.browse_folder)
def browse_folder(self):
    folder = QFileDialog.getExistingDirectory(self)
    if folder:
        self.path_input.setText(folder)
        self.folder = folder

โš™๏ธ Step 4.2: Options (Checkboxes)

self.txt_checkbox = QCheckBox(".txt")
self.pdf_checkbox = QCheckBox(".pdf")
self.html_checkbox = QCheckBox(".html")

self.check_broken_checkbox = QCheckBox("Check Broken Links")

๐Ÿ” Step 4.3: Filters

self.include_input = QLineEdit()
self.include_input.setPlaceholderText("Include words")

self.exclude_input = QLineEdit()
self.exclude_input.setPlaceholderText("Exclude words")

โ–ถ๏ธ Step 4.4: Start Scan

def start_scan(self):
    self.worker = LinkWorker(
        self.folder,
        {
            'txt': self.txt_checkbox.isChecked(),
            'pdf': self.pdf_checkbox.isChecked(),
            'html': self.html_checkbox.isChecked()
        },
        self.check_broken_checkbox.isChecked(),
        self.include_input.text().split(","),
        self.exclude_input.text().split(",")
    )

    self.worker.found.connect(self.add_link)
    self.worker.start()

๐ŸŽจ Step 5: Display Results

def add_link(self, link, is_broken):
    item = QListWidgetItem(link)

    color = QColor("red") if is_broken else QColor("green")
    item.setForeground(color)

    self.results_list.addItem(item)

๐Ÿ’ก Result:
๐ŸŸข Green โ†’ Working link
๐Ÿ”ด Red โ†’ Broken link
๐Ÿ“Š Step 6: Progress Bar

self.progress_bar = QProgressBar()
self.progress_bar.setMaximum(100)

Update it from the worker:

self.worker.progress.connect(self.progress_bar.setValue)
๐Ÿ“‹ Step 7: Copy All Links

def copy_all_links(self):
    links = "\n".join(
        self.results_list.item(i).text()
        for i in range(self.results_list.count())
    )

    QGuiApplication.clipboard().setText(links)

๐ŸŒ Step 8: Open Links on Double Click

def open_item(self, item):
    url = item.text()

    if platform.system() == "Windows":
        os.startfile(url)
    else:
        subprocess.Popen(["xdg-open", url])

๐Ÿš€ Step 9: Run the App

if __name__ == "__main__":
    app = QApplication(sys.argv)
    window = LinkApp()
    window.show()
    sys.exit(app.exec())

๐ŸŽ‰ Final Result

You now have a professional desktop tool that:

โœ” Extracts links from files
โœ” Filters intelligently
โœ” Detects broken links
โœ” Displays results beautifully
โœ” Runs smoothly with threads

๐Ÿ’ก Bonus Ideas

Want to upgrade it further?

Export results to CSV
Add domain grouping
Add link preview
Add multi-threaded link checking (faster ๐Ÿš€)

Comments (0)

Sign in to join the discussion

Be the first to comment!