Benutzer-Werkzeuge

Webseiten-Werkzeuge


elasticsearch

Unterschiede

Hier werden die Unterschiede zwischen zwei Versionen angezeigt.

Link zu dieser Vergleichsansicht

Beide Seiten der vorigen Revision Vorhergehende Überarbeitung
elasticsearch [2025/06/30 20:06]
jango [Query Data]
elasticsearch [2025/06/30 20:38] (aktuell)
jango [Python]
Zeile 229: Zeile 229:
  
 =====Python===== =====Python=====
 +
 +<code python>
 +import logging
 +from urllib.parse import urljoin
 +import requests
 +from bs4 import BeautifulSoup
 +
 +logging.basicConfig(
 +    format='%(asctime)s %(levelname)s:%(message)s',
 +    level=logging.INFO)
 +
 +class Crawler:
 +
 +    def __init__(self, urls=[]):
 +        self.visited_urls = []
 +        self.urls_to_visit = urls
 +
 +    def download_url(self, url):
 +        return requests.get(url).text
 +
 +    def get_linked_urls(self, url, html):
 +        soup = BeautifulSoup(html, 'html.parser')
 +        for link in soup.find_all('a'):
 +            path = link.get('href')
 +            if path and path.startswith('/'):
 +                path = urljoin(url, path)
 +            yield path
 +
 +    def add_url_to_visit(self, url):
 +        if url not in self.visited_urls and url not in self.urls_to_visit:
 +            self.urls_to_visit.append(url)
 +
 +    def crawl(self, url):
 +        html = self.download_url(url)
 +        for url in self.get_linked_urls(url, html):
 +            self.add_url_to_visit(url)
 +
 +    def run(self):
 +        while self.urls_to_visit:
 +            url = self.urls_to_visit.pop(0)
 +            logging.info(f'Crawling: {url}')
 +            try:
 +                self.crawl(url)
 +            except Exception:
 +                logging.exception(f'Failed to crawl: {url}')
 +            finally:
 +                self.visited_urls.append(url)
 +
 +if __name__ == '__main__':
 +    Crawler(urls=['https://www.imdb.com/']).run()
 +</code>
 +
  
 ====Create Index==== ====Create Index====
elasticsearch.1751306808.txt.gz · Zuletzt geändert: 2025/06/30 20:06 von jango