Benutzer-Werkzeuge

Webseiten-Werkzeuge


elasticsearch

Unterschiede

Hier werden die Unterschiede zwischen zwei Versionen angezeigt.

Link zu dieser Vergleichsansicht

Beide Seiten der vorigen Revision Vorhergehende Überarbeitung
Nächste Überarbeitung
Vorhergehende Überarbeitung
elasticsearch [2025/06/30 20:06]
jango [Query Data]
elasticsearch [2025/07/05 11:13] (aktuell)
jango [Service]
Zeile 91: Zeile 91:
 sudo curl -X GET "https://localhost:9200/_cat/indices" --insecure -u elastic:dein_sicheres_passwort sudo curl -X GET "https://localhost:9200/_cat/indices" --insecure -u elastic:dein_sicheres_passwort
 </code> </code>
 +=====Sicherheit=====
 +
 +[[https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/security-settings|Elasticsearch Security Settings]]
 +
 =====Service===== =====Service=====
  
Zeile 229: Zeile 233:
  
 =====Python===== =====Python=====
 +
 +<code python>
 +import logging
 +from urllib.parse import urljoin
 +import requests
 +from bs4 import BeautifulSoup
 +
 +logging.basicConfig(
 +    format='%(asctime)s %(levelname)s:%(message)s',
 +    level=logging.INFO)
 +
 +class Crawler:
 +
 +    def __init__(self, urls=[]):
 +        self.visited_urls = []
 +        self.urls_to_visit = urls
 +
 +    def download_url(self, url):
 +        return requests.get(url).text
 +
 +    def get_linked_urls(self, url, html):
 +        soup = BeautifulSoup(html, 'html.parser')
 +        for link in soup.find_all('a'):
 +            path = link.get('href')
 +            if path and path.startswith('/'):
 +                path = urljoin(url, path)
 +            yield path
 +
 +    def add_url_to_visit(self, url):
 +        if url not in self.visited_urls and url not in self.urls_to_visit:
 +            self.urls_to_visit.append(url)
 +
 +    def crawl(self, url):
 +        html = self.download_url(url)
 +        for url in self.get_linked_urls(url, html):
 +            self.add_url_to_visit(url)
 +
 +    def run(self):
 +        while self.urls_to_visit:
 +            url = self.urls_to_visit.pop(0)
 +            logging.info(f'Crawling: {url}')
 +            try:
 +                self.crawl(url)
 +            except Exception:
 +                logging.exception(f'Failed to crawl: {url}')
 +            finally:
 +                self.visited_urls.append(url)
 +
 +if __name__ == '__main__':
 +    Crawler(urls=['https://www.imdb.com/']).run()
 +</code>
 +
  
 ====Create Index==== ====Create Index====
elasticsearch.1751306808.txt.gz · Zuletzt geändert: 2025/06/30 20:06 von jango