ht://Check

A link checker tool
Download

ht://Check Ranking & Summary

Advertisement

  • Rating:
  • License:
  • GPL
  • Price:
  • FREE
  • Publisher Name:
  • Gabriele Bartolini
  • Publisher web site:
  • http://www.htminer.it/home.it.htm

ht://Check Tags


ht://Check Description

A link checker tool ht://Check is more than a link checker, it is a console application written for GNU/Linux systems in C++ and derived from the best search engine available on the Internet for free: ht://Dig.It is very useful for Webmasters who want to control and monitor their Websites, in order to discover unexpected broken links, but also interesting information from the data they have in the form of HyperText documents.Also, ht://Check builds a very useful and extremely complete data source in MySQL, it is very flexible and easily improvable as it is an opensource project. Anyway, for more info, you are strongly suggested to give a look at the General Info and the Features sections.General infoht://Check is more than a link checker. It's a console application written for GNU/Linux systems in C++ and derived from the best search engine available on the Internet for free: ht://Dig.It can retrieve information through HTTP/1.1 and store them in a MySQL database, and it's particularly suitable for small Internet domains or Intranet.Its purpose is to help a Webmaster managing one or more related sites: after a "crawl", ht://Check creates a powerful data source made up of information based on the retrieved documents. The kind of information available to the ht://Check user includes:- single documents attributes such as: content-type, size, last modification time, etc.;- information regarding the retrieval process of a resource, like for instance whether the resource was succesfully retrieved, or not, showing the various results (the so-called HTTP status codes, as ht://Check uses this protocol for crawling the Web);- information regarding the structure of a document, basically its HTML link tags, and the relationships they issue, in a whole process view: basically, ht://Check is able to crawl a Web domain or set (in the algebrical meaning), and links create sort of inter-documents relationships in it. This feature, allows the user to get further information from the domain regarding:- link results: if it either working or broken or redirected; also at the current status, it checks whether a link is actually an anchor that does not work, or it is a javascript or an e-mail;- the relationships between documents, in terms of incoming links and outgoing ones; in the future, particular attention in the development will be given to the Web structure mining activity.A skinny report is given by the program htcheck, however at the current situation most of the information is given by the PHP interface which comes with the package and that is able to query the database built by the htcheck program in a previously made crawl. It goes without saying that you need a Web server to use it, and of course PHP with the MySQL connectivity module.By the way, as long as after a crawl ht://Check produces a database on a MySQL server, it's needless to say that every user theoretically could build its own information retrieval interface to this database; you only need to know the structure of it, its tables and fields, and the relationships among them.Other solutions are represented by independent scripts written by using common scripting languages with MySQL connectivity modules (i.e. Perl and Python), or faster programs written in C or C++ using MySQL API or wrapper libraries (such as MySQL++ or dbconnect), or other Web driven solutions like JSP, ColdFusion. There exists an interface to ht://Check for the Roxen Web server written by Michael Stenitzer (stenitzer@eva.ac.at).Here are some key features of "ht://Check":The "Spider" or "Crawler" · HTTP/1.1 compliant with persistent connections and cookies support· HTTP Basic authentication supported· HTTP Proxy support (basic authentication included)· Crawl customisable through many configuration attributes which let the user· limit the digging on URLs pattern matchings and distance ("hops") from the first URL.· MySQL databases directly created by the spider· MySQL connections through user or general option files as defined by the· database system (/etc/my.cnf or ~/.my.cnf) · No support for Javascript and other protocols like HTTPS, FTP, NNTP and local files. The "Analyser" · Just a preface: as long as all of the data after a crawl are all stored into a MySQL database, it is pretty easy to get your desired info by querying the database. The spider, anyway, is included into the 'htcheck' application, which at the end shows by itself a small text report. In a second time you can always retrieve info from that database by building your own interface (PHP, Perl for instance) or by just using the default one written in PHP. · I also believe that ht://Check builds a data source that can be used for Web structure mining, revealing knowledge about the relationships within and between documents. Also Web usage mining tools can find interesting information from ht://Check, and use it as auxiliary data source in order to build a sort of site map. · 'htcheck' (the console appllication) gives you a summary of broken links, broken anchors, servers seen, content-types encountered. · The PHP interface lets you perform:· Queries regarding URLs, by choosing many discrimineting criterias such as· pattern matching, status code, content-type, size.· Queries regarding links, with pattern matching on both source and destination· URLs (also with regular expressions), the results (broken, ok, anchor not found,· redirected) and their type (normal, direct, redirected).· Info regarding specific URLs (outgoing and incoming links, last modify· datetime, etc ...· Info regarding specific links (broken or ok) and the HTML instruction that· issued it· Statistics on documents retrieved Requirements: · GNU C/C++ compiler and libstdc++ · MySQL 4.x, 3.23.xx or 3.22.xx · PHP version 4.x (if you want to use the interface - it should work with PHP 3 too but I can't test it anymore)


ht://Check Related Software