Linux Index

Make a documents database



Introduction

Many people classify their files in a directory structure. Suppose you have many documents and you want to be able to find them back easily and quickly. You need a search engine able to find a document based on its keywords, title, author name, kind of document, ...

The purpose of this section is to explain, step for step, a very appropriate solution to find any document stored somewhere in a directory structure very fast. Give your search criterion in a web form and you get a list of matching documents. Click on one of the documents in the list and it's open !

This solution is multi-platform, free and flexible. It's just an assembly of renowned software with small scripts around them to clue the whole thing. The basic ingredients of this receipt are :

  • Tcl/Tk scripting language
  • Mysql database
  • Apache web server
  • PHP scripting language
  • Your favorite web-browser

If many users need to access your database, you have to install the software only once on a central computer. Each user only need a web browser to access the database from anywhere on the network. To prevent any abusive use, the user need to give a password before he can access the documents.

The story in an example

Suppose you have a directory 'Documents' containing the following files :
Documents
phones.txt
Project1
hf38_specs.pdf
kg76_specs.pdf
report.doc
Project2
hf_meas.xls
letter.doc

For each file you want to see in your database, you will add a description file with the extension '.nfo' as follow :
Documents
phones.nfo
phones.txt
Project1.nfo
Project1
hf38_specs.nfo
hf38_specs.pdf
kg76_specs.nfo
kg76_specs.pdf
report.nfo
report.doc
Project2.nfo
Project2
hf_meas.nfo
hf_meas.xls
letter.nfo
letter.doc

The description files contain informations about the corresponding document. Only the first four letters of the field name are significant. It may be :

  • titl : the title of the document
  • keyw : keywords
  • auth : the name of the author(s)
  • crea : the creation date
  • proj : the name of the project
  • type : the kind of document (for example : calculation, measurement, document, script ...)
  • refe : a reference or a document number

All fields are optional but you should at least fill the field 'titl'.

In the previous example, we could have :

phone.nfo

title    : phone book
keywords : phone
author   : Bruno Champagne
type     : document
project  : none
Project2.nfo
title    : Project no 2
project  : project2
hf_meas.nfo
title    : HF measurements
keywords : HF jitter
type     : measurements
author   : Antonio Soubri
letter.nfo
title    : research contract
keywords : KGF contract
type     : document
author   : Mona Moors

In place of 'author', you may just write 'auth'.

Don't forget to create the '.nfo' file for directories (even an empty file) otherwise sub-directories won't be scanned.

A Tcl script will scan the directories and try to find any '.nfo' file and the corresponding document. The script will fill the database (here an SQL database). In the table below, you see what the database entries will look like (only a few columns are shown) :

titlekeywordstypeproject
phone bookphonedocumentnone
Project no 2--project2
HF measurementsHF jittermeasurementsproject2
research contractKGF contractdocumentproject2
............

This table also contains the name of the author, a reference and the location of the file.

Actually, it is slightly more complicated. We suppose we have a limited number of types and projects. So, the real database contains 3 tables :

  • a documents table is the same than described above expected that the columns 'type' and 'project' are filled with an index in place of a string. The index points to an entry in another table type or project
  • the project table has two columns : the first is an index and the second is a string containing the name of the corresponding index
  • the type table has two columns : the first is an index and the second is a string containing the name of the corresponding type

But those are details you don't need to care about ...

Fields values may inherit from a parent directory :

  • type may inherit from a parent directory if not specified for the current file
  • project may inherit from a parent directory if not specified for the current file
  • the keywords of the parent directories are added to the keywords of the current file

We can connect to the search engine with a simple web browser. If we configure Apache so that our files are 'a restricted stuff', the user is first prompted for his/her login and password.

Then we get a search form such as the following :

Search document
project
title containing
keywords
reference
document type
dirname contains
filename contains
authors contains

As you can see in the form above, the possible values for 'project' and 'type' are automatically filled from the scanned files.

After you've typed your search criterion, you get a list of matching documents :

Search results
title/keywordsauthor(s)type/fileproject
phone book
phone
Bruno Champagnedocument
phone.txt
none
research contract
KGF contract
Mona Moorsdocument
letter.doc
project2

The title of the document is also a hyper-link to the document. If your browser has the appropriate plug-ins, one click one the title is enough to open the document.

Apache/PHP setup

Download Apache at http://httpd.apache.org. Install it. Make a directory where you will put your documents. For example, make 'C:/html'.

Download PHP at http://www.php.net/. Install it.

In the configuration file of Apache, httpd.conf, change the setting 'DocumentRoot' to point to the directory containing your html files. For example,
DocumentRoot "C:/html"

You also need to specify the directories to be served and the corresponding 'alias' names. For example if you want to be able to access to the directories 'C:/documents_project1' and 'C:/personnal_docs', add the following lines in the file httpd.conf:
alias "/project1" "C:/documents_project1"
alias "/family" "C:/personnal_docs"

In the same file, check the section included between '<Directory />' and '</Directory>'. It should look like this :

<Directory />
  Options FollowSymlinks
  AllowOverride None
  AuthName "restricted area"
  AuthType Basic
  AuthUserFile "c:/pass.txt"
  require valid-user
</Directory>

In the example above, we specify that any user has to identify himself before he can access to the documents. The file containing the user login/password is (in this example) named 'c:/pass.txt'.

You also need to say Apache where to find the PHP interpretor. For example, if the interpretor is 'C:/PHP/PHP.EXE', then you need to check that the following lines are present in the file http.conf :

ScriptAlias /php/ "c:/php/"
AddType application/x-httpd-php .php
Addtype application/x-httpd-php-source .phps
Action application/x-httpd-php "/php/php.exe"

Define a new user. Go into the bin directory of Apache in a console. (for Windows users, this directory should be 'C:/Program files/Apache Group/Apache/bin'). Type

htpasswd -c c:/pass.txt username
where you should replace 'username' by the name of the user you wish to add. To add a second user, type
htpasswd c:/pass.txt username2
where you should replace 'username2' by the name of the user you wish to add.

Start the Apache server.

Tcl + SQL library

Download Tcl/Tk 8.3 from http://dev.scriptics.com/software/tcltk/download83.html. Install it.

Now you need a library to allow tcl to access Mysql. Download fbsql at http://www.fastbase.co.nz/fbsql/index.html. Windows users: install the dll file in the bin directory of tcl. Unix users: follow the instructions ...

Scripts installation

Download the following zip file scripts.zip and install its contents in the root directory of the Apache server. In our example, install the files in the directory 'C:/html'.

You will find the following files :

  • index.html : the first file Apache will open when you connect. It just contains a link to 'search.php'. But it's up to you to make a more attractive site ...
  • search.php : php script containing the search form
  • results.php : php scripts that shows the result of the search
  • initdb.tcl : prepare Mysql for the documents database
  • makeindex.tcl : scans the directories and fills the database. Every hour, it will update the database in background.

Mysql setup

If needed, download Mysql at http://www.mysql.org. Install it.

Start the server :

  • Under windows, if you don't see any traffic light on the corner of the screen, click on 'C:/mysql/bin/winmysqladmin' and start the server by clicking with right key of the mouse on the red light and selecting 'start server'
  • Under Linux, log as 'root' and type : safe_mysqld &. Before you start using the database, you need to create the grant tables (which determines who can connect to the database). So type : mysql_install_db

Now we need to prepare Mysql for our documents database. Execute the script 'initdb.tcl'.

  • It tries to connect to the Mysql server.
  • If needed, it prompts for a new password for the user 'root'. 'root' is the administrator user of Mysql database (don't confuse with the administrator of Unix machines!).
  • It creates a new user 'db_user'. The password is 'db_pass'. This user can only connect locally.
  • It creates a new database 'documents_db' and all the needed tables. One of them ('scan_dirs') contains a list of directory to be scanned. The 'initdb.tcl' script inserts one entry in this table : 'documents' (it's the default name of the directory where you will put your documents).

Restart the Mysql server.

Changing the directories to be scanned

The script 'makeindex.tcl' mentioned above has to know where are the directories to be scanned.

If you only want to scan the directory 'documents' (or more precisely, 'C:/html/documents'), you don't need to change anything (because it is the default setting).

Suppose you want to scan the directories named 'C:/documents_project1' and 'C:/personnal_docs' (see Apache setup). Start Mysql in the console :
(Windows users : start MS DOS, go into the directory C:/mysql/bin ; Unix user : no problem)

mysql -udb_user -p
When prompted for password enter 'db_pass'. Then,
use documents_db;

To see the list of the scanned directories, type

select * from scan_dirs;

You will only see the directories 'documents'. To suppress this first element of this list, type

delete from scan_dirs where id=1;

Now, you can insert the new entries.

insert into scan_dirs values(null,
           'C:/documents_project1','project1');
insert into scan_dirs values(null,
           'C:/personnal_docs','family');

As you can see, for each directory you add, you also need to add its alias name (the same than specified in the setup of the Apache server). This needed to take care that the search results are linked to the right web address.

Try it !

At this stage, the database should be fully operational.

First of all, you should create the directory where you want to place your documents. For example, create 'C:/documents_project1'. Place a few documents in this directory and make the corresponding '.nfo' files. You may also create sub-directories but don't forget to create a '.nfo' file for each sub-directory (even an empty '.nfo' file is OK). Start the script 'makeindex.tcl'. It will run in background and update silently the database every two hours.

Start your favorite web-browser. As address, type 'http://127.0.0.1' if you are working without network or enter the address of your computer (or the one where the server is running) if you are working on network. Now you should see the prompt form for login and password. Enter the user name and password you have defined as described above. Click on the link 'search document'. You should see the form 'Search document'. Click on 'Submit' and you will see a list of all the indexed documents.

Remark: the use of this database is not limited to documents. You really can use it for anything. For example, if you want to make a database of your friends, you can for example make a html file for each of them where you enter any information you want. You can even place a picture. Or you can just make a scan of their name-card. Make the corresponding '.nfo' file.

Auto-logon and auto-startup when using Windows computer

This section is only applicable for Windows users. Whereas it should be easy to do the same job on an Unix computer, I've not yet tried this.

First of all, if you have NT computer, you can configure the Apache Web server and the MySQL database as 'services' so they are automatically started at startup.

Secondly, to be able to access the network drives, you need to logon. To be sure the same login is used every time, the simplest solution is to use auto-logon. Click on the 'Start' button, 'Run...', then type 'regedit.exe'. Select the path 'HKEY_LOCAL_MACHINE/SOFTWARE/Microsoft/Windows NT/CurrentVersion/Winlogon'. Define the following entries as string :

  • AutoAdminLogon, value "1"
  • DefaultUserName
  • DefaultDomainName
  • Defaultpassword

Thirdly, to start the script 'makeindex.tcl' automatically after each login, go to 'HKEY_LOCAL_MACHINE/SOFTWARE/Microsoft/ Windows/CurrentVersion/Run' in the registry. Define a string entry, for example "shutdownscript" and give it as value the location of the makeindex script, for example "c:\html\makeindex.tcl".

If you also want to shutdown automatically at a fixed time, refer to the chapter 'auto-shutdown'.