Table of Content

  1. General description
  2. Compiling
  3. Usage
  4. Command line (list of commands)
  5. Binary mode
  6. Tutorial (simple example)
  7. Version history and outlook
  8. Changelog
  9. Add-ons
  10. Compiling algorithms
  11. Creating new algorithms
  12. Developed and tested with
  13. Bugs & Contact
  14. License
  15. Reference: Configuration file formats
  16. Appendix: File lists

1. General description

The parsefile program parses files using different parsing algorithms, specified in library files.

It includes a command line, which is used to manage files, parsing algorithms, configurations and variables.
The program can also be run without any user input by using configuration files and command-line arguments.

It is meant to be small, simple and maximally extendable by using shared libraries and C++ classes when needed.

The program must be run in its program directory, which includes at least the following directories:

algos - for the algorithm files: libraries (*.so), descriptions (*.txt), temporary files (*.tmp)
conf - for configuration files (*.cfg)
output - for the output files (*.out)
tmp - for temporary files

Optional directories include:

src - the source files (*.cpp) and header files (*.h) of the program and a compile.sh for compiling and linking
algos/src - the source files (*.cpp) and header files (*.h) of the provided algorithms and a compile.sh for compiling
algos/src/templates - template files for the programming of new algorithms
input - different example files for parsing

Note: The program has to be run exactly in its program directory, not in one of these sub-directories!

Back to top

2. Compiling

The compiled and executable binary file parsefile is already included in the package.

If you want to recompile and link the source code of the main program, which is found in the src folder (if received), you can use the g++ compiler with the command line options -ldl and -rdynamic for the dynamic library loading fuctionality.

Optionally, you can use the src/compile.sh by opening the terminal in the main program folder (not the src folder) and typing sh src/compile.sh.

Back to top

3. Usage

parsefile
Opens the command console.

parsefile <config>
Loads the configuration <config> and opens the command console.
Enter start to immediately start the parsing process, using this configuration.

parsefile <config> --start
parsefile <config> -s
Loads the configuration <config> and starts the parsing process immediately, using this configuration.

Back to top

4. Command line (list of commands)

Note: Commands and arguments can be separated by space(s) or tabulator(s).

Here is the complete list of commands for the current version of parsefile:

help [<command>] Displays the current list of commands, or - if specified - information about <command>.
Identical commands: ?, h, man
version Displays the current version of the main program.
Identical commands: v, ver
addfile <file> Adds <file> to the list of files to parse and displays the file number used by removefile.
Identical commands: f, af, addf, file
showalgos Shows the list of available parsing algorithms.
Identical command: algos
addalgo <name> Adds parsing algorithm <name> to the parser and displays the algorithm number used by removealgo. Note that for most parsing algorithms the order of the algorithms added is important. The algorithm added first will be run first, etc.
Identical commands: a, aa, adda, algo
removefile <number> Removes file number <number> (see result of addfile or showconf) from the file list.
Identical commands: rf, rmvf, rmvfile
removealgo <number> Removes algorithm number <number> (see result of addalgo or showconf) from the list of algorithms.
Identical commands: ra, rmva, rmvalgo
clear Removes all added global variables, files and algorithms.
Identical command: clr
clearfiles Removes all added files.
Identical commands: cf, clrf, clrfiles
clearalgos Removes all added algorithms.
Identical commands: ca, clra, clralgos
clearvars Removes all added global variables.
Identical commands: cv, clrv, clrvars
showconf Shows the current configuration, which includes all file names, parsing algorithms and global variables. Includes also the file and algorithm numbers used by removefile and removealgo. During the parsing process, the list of algorithms will not be shown.
Identical commands: sc, conf, confg, showconfg, showconfig
load <name> Loads the configuration named <name> from its configuration file (conf/<name>.cfg).
Identical commands: l, o, ld, rd, open, read
save <name> Saves the current configuration to a configuration file so that it can be loaded by load <name>.
Identical commands: w, sv, wr, write
set <name>[=<value>] Sets the global variable <name> to <value>. If no value is given, an empty string will be set.
Identical command: st
unset <name> Unsets the global variable <name>.
Identical commands: u, us, uset
get <name> Gets the value of the global variable <name>.
Identical commands: g, gt
start Starts the parsing and quits the program when finished.
Identical commands: r, s, rn, run, strt
exit Quits the program without parsing.
Identical commands: c, e, q, x, cncl, quit, close, cancel

Back to top

5. Binary mode

If you want to parse binary files, you have to put the program into binary mode. Simply add a global variable named binary to your configuration, e.g. by using the set binary command. This variable will - like all other variables - also be saved in any configuration file created by the save command.

The binary mode has been introduced in version 0.5a.

6. Tutorial (simple example)

The following example parses a HTML file (located at input/example.htm), removes all HTML comments and returns the content of the <body> tag only, which will be written into the output file. When finished, it checks for links in this HTML file and processes all files which have been linked to in the very same way using the HTMLSpider parsing algorithm.

Note: The example will not work if you only got the minimum package of parsefile, unless you create your own input file at input/example.htm.

To reach our goal, we will use four different parsing algorithms:

1. html_remove_comments
Will remove all HTML comments so that HTML parsing can start.
HTML parsing algorithms usually don't check if a found tag is inside a comment or not, that's why html_remove_comments should be used before using any other HTML parsing algorithms.

2. html_body
Leaves just the content of the <body> tag, the rest will be ignored.
We don't need more, because the links we are looking for are supposed to be inside the <body> tag.

3. print_content
For demonstration purposes, we will write the manipulated content into the output file(s).

4. HTMLSpider
After we're done, we let HTMLSpider check the content (which is now the content of the <body> tag only) for links.
The algorithm will automatically add files found in link tags, and the program will start the whole parsing process for them, too.

The setting up of the example is easy. Start parsefile and enter the following commands (without the ": " in front, it just represents the command line prompt):

: addfile input/example.htm
: addalgo html_remove_comments
: addalgo html_body
: addalgo print_content
: addalgo html_spider

We only want to parse HTML files, so we add a filter to HTMLSpider by setting a global variable:

: set html_spider.filter=htm,html,php,php5,

Note: The last comma says, that files without extensions will be parsed, too. Some webpages like wikis provide html files without extensions.

Add a comment to your new configuration by typing:

: set comment=Example configuration for README.txt

You can now enter the showconfig command to check the configuration you created. It should produce a similar result to the following, depending on the versions of your algorithms:

: showconfig

FILE LIST
=========
- #1 input/example.htm

ALGORITHM LIST
==============
- #1 "html_remove_comments", version 1.5b (BETA) by Ans
- #2 "html_body", version 1.5b (BETA) by Ans
- #3 "print_content", version 1.5 (RELEASE) by Ans
- #4 "HTMLSpider", version 1.5b (BETA) by Ans

VARIABLE LIST
=============
- "version" = "0.8a (ALPHA)"
- "version_id" = [data]
- "path" = "/<...>/parsefile"
- "html_spider.filter" = "htm,html,php,php5,"
- "comment" = "Example configuration for README.txt"

Note: The version variable is an internal read-only constant and contains the version of the main program. [data] represents binary data.

To save the configuration, you can enter save example. It will be written into the config file conf/example.cfg and can easily be loaded by load example another time, to run it again or change it.

: save example

Now the moment has come to let parsefile do what it is supposed to do: parse the file. Enter start!

: start

The output can be similar to this, depending on the HTML files in your input directory:

Okay, let's go...

Parsing will be done in TEXT mode.

Jumping to file #1: "input/example.htm".
Reading file...
Running "html_remove_comments"...
Running "html_body"...
Running "print_content"...
Running "HTMLSpider"...
[HTMLSpider] Global variable "html_spider.max_files_to_add" doesn't exist.
[HTMLSpider] Using the default value instead.
[HTMLSpider] Maximum number of files to add set to 1000.
[HTMLSpider] Filter set to "htm,html,php,php5,".
[HTMLSpider] Added file to list: "input/linktest.htm".
Saving output to "output/example.htm.out"...
File done: "input/example.htm".

Jumping to file #2: "input/linktest.htm".
Reading file...
Running "html_remove_comments"...
Running "html_body"...
Running "print_content"...
Running "HTMLSpider"...
[HTMLSpider] Added file to list: "input/test.htm".
[HTMLSpider] File already in list: "input/test.htm".
[HTMLSpider] File already in list: "input/test.htm".
[HTMLSpider] File already in list: "input/test.htm".
[HTMLSpider] File already in list: "input/test.htm".
Saving output to "output/linktest.htm.out"...
File done: "input/linktest.htm".

Jumping to file #3: "input/test.htm".
Reading file...
Running "html_remove_comments"...
Running "html_body"...
Running "print_content"...
Running "HTMLSpider"...
Saving output to "output/test.htm.out"...
File done: "input/test.htm".

All files done.

Finishing "html_remove_comments"...
Finishing "html_body"...
Finishing "print_content"...
Finishing "HTMLSpider"...

Thank you for using "parsefile" and have a nice day!

As you see, we could have used the set command to set the global variable html_spider.max_files_to_add and change the limit of files, that HTMLSpider maximally adds. The default value is 1,000 files.

In this example, HTMLSpider finds two additional files, because there is a link in the example.htm to the linktest.htm, in which (in the version used here) there are five links, all to test.htm. Nevertheless, test.htm will only be parsed once.

For all parsed files there should now be output files in the output folder. Open them with a text editor and you will find that all the comments have been removed (by html_remove_comments) and only the content of the <body> tag is left (by html_body). The copying of the created content into these output files was done by print_output.

Next time, you don't have to manually add the file and all the algorithms. Just use the load command:

: load example

Back to top

7. Version history and outlook

History

0.1a - First alpha version. Can open local files and algorithms only. No download functionality yet.
0.2a - Global variables environment (vars class) has been added.
0.3a - Command line can be skipped by --start/-s argument.
0.4a - Updated global variables environment.
0.5a - Supports binary files.
0.6a - Bugfix and update of filelist and vars classes, change of configuration file format.
0.7a - Algorithms can only be added once, except they allow multiple instances.
0.8a - Introduced new global system variables and changed the clearing order of the algorithms.

See Changelog for more detailed version information.

Outlook

1.0b - First beta version. Will be able to download files, but can use local algorithms only.
1.0 - First release version. Will be able to download files and algorithms.

Portations

No portations to other operating systems than Linux are planned, but feel free to port it yourself and tell the world about your awesome work. See Bugs & Contact also.

Back to top

8. Changelog

As soon as you install a new version, check the changelog for changes, maybe there will be funtionality added that needs you to change your algorithm or configuration files.

Changes from v0.1a to v0.2a

The global variables environment (vars class) has been added. It allows the user and the parsing algorithms to set, unset and read global variables.

New command line commands are: set, unset, get.
The command line command showconf has been extended to show all defined global variables.

Important change: The algorithm class has been changed. The parse function has a new parameter for the global variables enviroment. All parsing algorithms from v0.1a should be changed by adding the new parameter to the parse function.

See Creating new algorithms for more details.

Important change: The file format for the configuration files has been changed. It now includes the global variables. However, older configuration files (from v0.1a) are still supported.

Changes from v0.2a to v0.3a

Added the program arguments --start and -s to enable immediate starting of the loaded configuration file. No command line will be loaded. See Usage for more details.

Removed bug from the parsing of program arguments that occured when the name of the configuration file included spaces.

Changes from v0.3a to v0.4a

The global variables environment (vars class) has been made safer by adding an array that holds the information, whether a variable is a string or not, so other variables can be skipped while performing string-only operations. Before, these operations were not possible as soon as non-string variables have been added.

Important change: The algorithm class has been changed. The init function has a new parameter for the global variables enviroment. All parsing algorithms should be changed by adding the new parameter to the init function.

See Creating new algorithms for more details.

For compatibility reasons, the very same argument passed to the parse function remains.

Changes from v0.4a to v0.5a

Support for reading binary files has been added. The output file will be written in binary mode, supporting binary output files, too. The compatibility of algorithms will now be checked by introducing version constants (in version.h) and the new get_program_version function for algorithms (see below).

If you want to use the new binary mode (all input files will be handled as binary files), you have to set the global variable binary to any value, e.g. by using the command set binary, see Binary mode for details.

Important change: The algorithm class has been changed. The init and parse functions have new parameters for the size of the used buffers. Algorithms that write into the output/content buffers have to use these new arguments for providing the size of their data, because binary data is now supported and the strlen function cannot be used anymore on the buffers.

Important change: The algorithm class has been changed. The parse function doesn't include any argument for the global variables environment anymore. If needed in this function, the pointer has to be saved by the init function.

Important change: The algorithm class has been changed. For compatibility checks, every algorithm has to include the get_program_version function to return the used version of the main program. The version constants to be used can be found in the version.h header file of the main program.

All parsing algorithms from previous versions need to be changed by updating their member functions according to these changes. See Creating new algorithms for more information.

Important change: The vars class has been changed. There was a major bug in the set functions which has been removed. Another bug in the save_strings function, which could led to incomplete configuration files, has been removed. Most of the functions now return a boolean value for error handling and are able to handle memory shortages. All parsing algorithms need to use the new vars class.

Important change: The filelist class has been changed. It now handles relative path names in a better way, that will later be needed for HTTP support. Error handling has been added to the save and open functions. Also, the add_file function of the class is now able to tell whether a file has already existed and not been added. All parsing algorithms should use the new filelist class.

The parser class has been changed. Error handling has been added to the write_conf and read_conf functions. The read_conf function is now able to handle memory shortages.

The set command now doesn't need a value anymore. If used without value, it will set the variable to an empty string.

Changes from v0.5a to v0.6a

Important change: The algorithm class has been changed. Use of long instead of int. All algorithms need to be changed, for details see Creating new algorithms.

Important change: The filelist class has been changed. A major bug in the add_file function has been removed. The function also has a new parameter for the name of the algorithm that adds the file. In urls, it now ignores everything behind #. Like the "algorithm" class, the class now uses long instead of int. All algorithms using this class should be updated.

Important change: The vars class has been changed. Bugs in the open and add_alloc functions have been removed. Like the "algorithm" class, the class now uses long instead of int. All algorithms using this class need to be updated.

Important change: The file format of the configuration files has been changed. Global variables will be read before the parser configuration (used algorithms), so that the variables will affect the added algorithms correctly. Older configuration files are not supported anymore. See Reference: Configuration file formats for more information.

The new console command clearvars removes all added global variables. The clear console command has been updated to remove all added global variables, too.

Changes from v0.6a to v0.7a

Important change: The algorithm class has been changed. A new member function multiple_instances has been added, which returns, whether multiple instances of an algorithm are allowed. All algorithms need to be changed by adding the function, for details see Creating new algorithms.

Changes from v0.7a to v0.8a

The global variable path now represents the path of the program and can be changed. Make sure that it points to the main directory of the program.

The constant version_id in the global variable environment now contains an id of the current version of parsefile. It is safed as unsigned short and can be used by other parsing algorithms.

Parsing algorithm now can skip the rest of the current parsing process for one file by setting the global variable skip.

Important change: The vars class has been changed to handle the new global system variable path and a major bug in the set_extern / add_nalloc functions has been removed. All algorithms using this class should be updated.

Small bugs in reading the last line of configuration files and saving the file list have been removed.
A bug in the removefile and removealgo commands has been removed.
A bug in the open function of the filelist class has been removed.
A bug in the add_algorithm function of the parser class has been removed.

Important change: The filelist and parser classes have been improved. All algorithms using these classes should be updated.

The algorithms now finish backwards, the last algorithm added is the first to be cleared, etc.

You can find the changelogs of the algorithms inside their header files (*.h) located in algos/src/.

Back to top

9. Add-ons

The program is easily expendable.

To add a new parsing algorithm, copy the shared library (*.so) file and (optionally) the description (*.txt) file of the new algorithm into the algos folder.

To add a new configuration, copy the configuration (*.cfg) file into the confs folder. Make sure that you added all necessary algorithms before loading the configuration file, otherwise the load command will fail.

Back to top

10. Compiling algorithms

The included algorithms are already compiled into shared library (*.so ) files.

If you want to recompile them (or compile new ones), you can use the g++ compiler, which has to be called twice for every algorithm. The following examples should be run from inside the algos/src folder.

step #1: Create object file

g++ -Wall -fPIC -c <source code> [additional sources used]

This will create an object (*.o) file which has to be converted to a shared library (*.so) file.

Note: [additional sources used] can also be source code of the main program, for example filelist.cpp if the algorithm wants to use the filelist class to manipulate the list of files to be parsed, or vars.h if the algorithm wants to use the vars class to get access to the global variables environment (e.g. HTMLSpider uses both of them). Inside the algos/src folder, the path to the main program source (and header) files is ../../src/.

step #2: Create shared library

g++ -shared -o ../<name>.so <object file> [additional object files used]

This will save the shared library <name>.so in the directory above (algos).

Depending on the sources used in step #1, add the additional object files as needed, e.g. add filelist.o and/or vars.o if you want to use the filelist and/or vars classes of the main program and therefore created, in step #1, object files from the ../../src/filelist.cpp and/or ../../src/vars.cpp source code files.

The included default parsing algorithms provide a compile.sh, which is located in algos/src. Check out the source and run it from within this folder by opening the terminal in algos/src and typing sh compile.sh. It performs exactly step #1 and step #2 for all included default algorithms.

Back to top

11. Creating new algorithms

You can easily add own algorithms by creating a child class of the algorithm class in src/algorithm.h.

The class has to have the following functions:

bool init(char ** output, unsigned long * output_size, void * v)

Will be called when the algorithm is added to a file.

The output argument points to the output buffer, but *output will be NULL if no output has been created. Use the malloc/calloc functions to create the output buffer, it will be freed by the main program, but make sure to free any existing output buffer before by using the free function. In this case, you ideally would include the existing output in your new output.

The output_size argument points to a variable holding the size of the output buffer. If you change the output you have to change the value of this variable accordingly.

In the end of the parsing process, the output buffer (if existing) will be written to the output file. Both, text and binary output, are supported since version 0.5a.

Note that it can happen, that an algorithm will be added and removed from a file without parsing it. If you use the output buffer at this point already, you should handle this case by saving the old output (if existing) and recovering it if the clear function is called without a call to the parse function before.

Usually you will not need to use the output buffer in the init function already.

The v argument points to the vars object used for the global variable environment. You only need to use it if you want to set, unset or read global variables set by the user or other algorithms. If you want to use the vars object, you have to include the header file (../../src/vars.h). You also need to add the source code for the filelist object (../../src/vars.cpp) when compiling the algorithm. See Compiling algorithms for more information.

If you want to use global variables in your parse function, you have to save the pointer submitted to the init function in a variable inside the source code of your algorithm.

Please name your variables in the following format: <algoname>.<variable>. For example, see the variables used by the HTMLSpider algorithm:

html_spider.max_files_to_add
html_spider.tmp_added_files

Return true, if the loading of the algorithm was successfull or you don't need to load anything.
Returning false would mean, that the loading of the algorithm failed and it cannot be used for parsing the file.

bool parse(const char * fname, char ** content, unsigned long * content_size, char ** output, unsigned long * output_size, void * flist)

Will be called to parse the content of the file.

The content argument will point to the content buffer, containing the file content to parse. Make sure that you free the memory of the old content buffer by using the free function if you intend to write content that could be larger than the old one.

Use the malloc/calloc functions to create a larger content buffer, the program will free it automatically in the end.

The content_size argument points to a variable holding the size of the current content. If you change the content you have to change the value of this variable accordingly.

The output argument points to the output buffer, but *output will be NULL if no output has been created. Use the malloc/calloc functions to create the output buffer, it will be freed by the main program, but make sure you free any existing output buffer before by using the free function. In this case, you ideally would include the existing output in your new output.

The output_size argument points to a variable holding the size of the output buffer. If you change the output you have to change the value of this variable accordingly.

In the end of the parsing process, the output buffer (if existing) will be written to the output file. Both, text and binary output, are supported (since version 0.5a).

The fname argument contains the string with the file name. Usually, you don't need to use it, because the file has already been opened and read by the main program. Use the content buffer instead, to make sure, you parse the right content, not the original file. However, the fname argument can be useful for documentation purposes or if you need to find out the path of the file.

Note: The file name can be local (without directories or with sub-directories only) or global (complete path, url). Basically, the program supports urls starting with http://, https://, ftp:// and file:// (to be implemented by version 1.0b).

The flist argument points to the filelist object used for the list of files to parse. You only need to use it if you want to manipulate the list, e.g. adding additional files to parse in the end of the process. If you want to use the filelist object, you have to include the header file (../../src/filelist.h). You also need to add the source code for the filelist object (../../src/filelist.cpp) when you compile your algorithm. See Compiling algorithms for more information.

Return true if the parsing was successfull.

Returning false would indicate an error but not stop the rest of the parsing process. This can lead to unexpected results you should warn the user about.

void clear()

Is called when the algorithm is removed from the file or the parsing process is either cancelled or finished. You have to free used memory here, except for the content or output buffers, this will be done by the program.

const char * get_name() const
const char * get_version() const
const char * get_author() const

These three functions should return static strings containing name, version and author of the algorithm. They will be used by the main program to show more information about the added algorithm.

unsigned int get_program_version() const

This function has to return the version of the main program for which your algorithm has been written. Older versions of the main program will not be able to use your algorithm. If the version of the main program is higher than the version used by you, a warning message will be shown.

Make sure to update your algorithm with each new version of the main program. Check the Changelog section of the readme file to learn about possible changes to make to your algorithm.

The version constants that you should use are defined in the version.h header file of the main program, which should be included. You can also use the PARSEFILE_VER_CURRENT constant, which will be automatically updated in every new version of the main program, provided you don't use a copy of an outdated header file.

Example: If your algorithm source code is located inside the program's algo/src folder, use

#include "../../src/version.h"

to include the original (up-to-date) header file.

bool multiple_instances() const

This function has to return, whether multiple instances of the algorithm are allowed.

Note, that each instance has access to the same global variables environment. It is not possible to define different variables with the same name for different instances of your algorithm.

You also need to include construction and destruction functions in the source code of your class. They are needed by the main program to load your class and have to be called create and destroy.

The create function creates a instance of the class and returns it as a pointer to the algorithm class.
The destroy function deletes the created instance of the class.

Just copy the following code to the end of the source file of your class and replace <classname> with the name of your class:

// creation and destruction "C" functions for dynamic loading of the class
extern "C" {
algorithm* create() { return new <classname>; }
void destroy(algorithm * p) { delete p; }
}

If you want to write output to the console (stdout), please add a tag with the algorithm name in front, like:

printf("[<algoname>] This is output by the algorithm named <algoname>.");

The best way to find out more about programming parsing algorithms for parsefile is to check out the source code of the existing algorithms, which is located in the algos/src folder (if received).

To start a new algorithm, feel free to use the template files in algos/src/templates. Copy them into the algos/src folder, rename them to the name of your algorithm and open them to add your own code. To do so, follow the instructions inside the newly created files.

For information on how to compile your algorithms, see Compiling algorithms.

All your algorithms have to be licensed under the used GNU license, see License.

Back to top

12. Developed and tested with

Developed and tested with:

[v0.1a-v0.5a] g++ 4.5.2 on Ubuntu 11.04 (natty), Kernel Linux 2.6.38-11 generic, GNOME 2.32.1
[since v0.6a] g++ 4.6.1 on Ubuntu 11.10 (oneiric), Kernel Linux 3.0.0-12 generic, GNOME 3.2.0
[since v0.8a] g++ 4.6.3 on Ubuntu 12.04 (precise), Kernel Linux 3.2.0-38 generic, GNOME 3.4.2

No development software suite has been used due to the simplicity of the program. It can be discovered and extended by using a text editor (e.g. gedit) and the Ubuntu (or other Linux) command line tools.

Back to top

13. Bugs & Contact

The program website can be found at http://www.ghstyle.de/parsefile/. Contact ans(at)ghstyle.de for bug reports, questions and remarks. Please make sure that you choose a meaningful e-mail topic due to spam detection.

Note that the program is mainly programmed for own usage and no detailed compatibility support can be given. To receive support for non-default parsing algorithms, contact the author(s) of the respective algorithm.

Please contact me if you either make improvements or extension to the main program, or develop new parsing algorithms, so that these changes can be included in the original program package. This way, others will be able to reuse it and you will be able to receive the maximum credit for your work. Always remember: Sharing is caring!

Back to top

14. License

Copyright (C) 2011 by Anselm Schmidt, www.ghstyle.de.

parsefile is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

parsefile is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU General Public License for more details.

Back to top

15. Reference: Configuration file formats

Version 1

This version of the configuration file format is out of date. It was used between parsefile v0.1a and v0.5a. The problem was, that the global variables were loaded after the algorithms and therefore, global variables were not able to influence the initialization of the saved algorithms. This file format can manually be updated to version 2 by exchanging the part for global variables with the part for algorithms and updating the tag in the first line to [PFCFF2].

The format is in plain text and has the following structure:

[PFCFF]
<Main program version>
<Part for file list>
<Part for algorithms>
<Part for global variables>

The tag in the first line ([PFCFF]) stands for "parsefile configuration file format", version 1. The other parts are identical with version 2 (see below), just the order is different.

The last part has been added with parsefile v0.2a, which introduced the global variables environment.

Here is the configuration file conf/example.cfg created in the tutorial (see Tutorial) with an old version of parsefile, here the version 0.4a:

[PFCFF]
ver=0.4a (ALPHA)
1
input/example.htm
4
algos/html_remove_comments.so
algos/html_body.so
algos/print_content.so
algos/html_spider.so
1
comment
Example configuration for README.txt

Version 2

Current version of the configuration file format, since parsefile v0.6a.

The format is in plain text and has the following structure:

[PFCFF2]
<Main program version>
<Part for file list>
<Part for global variables>
<Part for algorithms>

The tag in the first line ([PFCFF2]) stands for "parsefile configuration file format", version 2. The other parts are structured the following way:

<Main program version>: Single line starting with ver= and ending with the version of parsefile that created the configuration file. Example: ver=0.6a (ALPHA)

<Part for file list>: The first line contains the number of files in the file list (N), followed by N lines, each of them containing a file name. Empty lines are possible, they mark already removed files.

<Part for global variables>: The first line contains the number of global variables (N), followed by N*2 lines, made of N pairs, each pair representing one variable. The first line of such a pair contains the name of the variable, the second line the value. Empty names are possible, they mark already removed variables. The values of pairs with an empty name will be ignored. Empty values represent empty variables.

<Part for algorithms>: The first line contains the number of algorithms to add (N), followed by N lines, each of them containing the path of an algorithm library to be loaded (e.g. algos/html_spider.so). Empty lines are possible, they mark already removed algorithms.

Here is the configuration file conf/example.cfg created in the tutorial (see Tutorial) with the current version of parsefile, version 0.6a:

[PFCFF2]
ver=0.6a (ALPHA)
1
input/example.htm
2
html_spider.filter
htm,html,php,php5,
comment
Example configuration for README.txt
4
algos/html_remove_comments.so
algos/html_body.so
algos/print_content.so
algos/html_spider.so

Back to top

16. Appendix: File lists

There are two packages available at the moment: The full package includes the source code of the program and the default parsing algorithms as well as some sample files. The minimal package contains the binary files only.

Full package

parsefile - executable program binary
README.txt - readme file
algos/ - folder for parsing algorithms
algos/command_line.so - the command_line parsing algorithm
algos/command_line.txt - short description of the command_line parsing algorithm
algos/count_strings.so - the count_string parsing algorithm
algos/count_strings.txt - short description of the count_string parsing algorithm
algos/count_strings_readme.txt - readme file for the count_string parsing algorithm
algos/csv.so - the csv parsing algorithm
algos/csv.txt - short description of the csv parsing algorithm
algos/html_body.so - the html_body parsing algorithm
algos/html_body.txt - short description of the html_body parsing algorithm
algos/html_remove_comments.so - the html_remove_comments parsing algorithm
algos/html_remove_comments.txt - short description of the html_remove_comments parsing algorithm
algos/html_remove_tags.so - the html_remove_tags parsing algorithm
algos/html_remove_tags.txt - short description of the html_remove_tags parsing algorithm
algos/html_spider.so - the HTMLSpider parsing algorithm
algos/html_spider.txt - short description of the HTMLSpider parsing algorithm
algos/print_binary.so - the print_binary parsing algorithm
algos/print_binary.txt - short description of the print_binary parsing algorithm
algos/print_content.so - the print_content parsing algorithm
algos/print_content.txt - short description of the print_content parsing algorithm
algos/show_content.so - the show_content parsing algorithm
algos/show_content.txt - short description of the show_content parsing algorithm
algos/wait.so - the wait parsing algorithm
algos/wait.txt - short description of the wait parsing algorithm
algos/src - folder for the source code of the parsing algorithms
algos/src/command_line.cpp - source code file of the command_line parsing algorithm
algos/src/command_line.h - header file of the command_line parsing algorithm, including changelog
algos/src/compile.sh - shell file to compile the default parsing algorithms
algos/src/count_strings.cpp - source code file of the count_strings parsing algorithm
algos/src/count_strings.h - header file of the count_strings parsing algorithm, including changelog
algos/src/csv.cpp - source code file of the csv parsing algorithm
algos/src/csv.h - header file of the csv_data structure used for storing CSV data in memory
algos/src/csv_data.h - header file of the csv parsing algorithm, including changelog
algos/src/html_body.cpp - source code file of the html_body parsing algorithm
algos/src/html_body.h - header file of the html_body parsing algorithm, including changelog
algos/src/html_remove_comments.cpp - source code file of the html_remove_comments parsing algorithm
algos/src/html_remove_comments.h - header file of the html_remove_comments algorithm, including changelog
algos/src/html_remove_tags.cpp - source code file of the html_remove_tags parsing algorithm
algos/src/html_remove_tags.h - header file of the html_remove_tags algorithm, including changelog
algos/src/html_spider.cpp - source code file of the HTMLSpider parsing algorithm
algos/src/html_spider.h - header file of the HTMLSpider parsing algorithm, including changelog
algos/src/pfc_data.h - header file of the pfc_data structure used for storing PFC data in memory
algos/src/print_binary.cpp - source code file of the print_binary parsing algorithm
algos/src/print_binary.h - header file of the print_binary parsing algorithm, including changelog
algos/src/print_content.cpp - source code file of the print_content parsing algorithm
algos/src/print_content.h - header file of the print_content parsing algorithm, including changelog
algos/src/show_content.cpp - source code file of the show_content parsing algorithm
algos/src/show_content.h - header file of the show_content parsing algorithm, including changelog
algos/src/wait.cpp - source code file of the wait parsing algorithm
algos/src/wait.h - header file of the wait parsing algorithm, including changelog
algos/src/templates - templates for the development of new algorithms
algos/src/templates/algo.h - template of the header file of a new algorithm
algos/src/templates/algo.cpp - template of the source file of a new algorithm
conf/ - folder for configuration files
conf/example.cfg - configuration file as created by the tutorial in this readme file
conf/count_strings_example.cfg - configuration file as created by the tutorial in the count_strings readme
input/ - folder for sample input files
input/example.htm - example HTML file used by the tutorial in this readme file
input/linktest.htm - example HTML file indirectly used by the tutorial in this readme file
input/test.htm - example HTML file indirectly used by the tutorial in this readme file
output/ - folder for output files
src/ - folder for the source code of the main program
src/algorithm.h - header file of the algorithm class used as parent class for the single parsing algorithms
src/commands.cpp - source code of the command line command functions
src/commands.h - header file of the command line command functions
src/compile.sh - shell file to compile the main program
src/filelist.cpp - source code of the filelist class used for the list of files to be parsed
src/filelist.h - header file of the filelist class used for the list of files to be parsed
src/functions.cpp - source code of different helper functions used by the program
src/functions.h - header file of different helper functions used by the program
src/main.cpp - source code of the main program including main function (program entry point)
src/parser.cpp - source code of the parser class used for adding algorithms and parsing files
src/parser.h - header file of the parser class used for adding algorithms and parsing files
src/vars.cpp - source code of the vars class used for managing the global variable environment
src/vars.h - header file of the vars class used for managing the global variable environment
src/version.h - header file containing version constants and the current version of the program
tmp - folder for temporary files

Minimal package

parsefile - executable program binary
README.txt - readme file
algos/ - folder for parsing algorithms
algos/command_line.so - the command_line parsing algorithm
algos/count_strings.so - the count_string parsing algorithm
algos/csv.so - the csv parsing algorithm
algos/html_body.so - the html_body parsing algorithm
algos/html_remove_comments.so - the html_remove_comments parsing algorithm
algos/html_remove_tags.so - the html_remove_tags parsing algorithm
algos/html_spider.so - the HTMLSpider parsing algorithm
algos/print_binary.so - the print_binary parsing algorithm
algos/print_content.so - the print_content parsing algorithm
algos/show_content.so - the show_content parsing algorithm
algos/wait.so - the wait parsing algorithm
conf/ - folder for configuration files
output/ - folder for output files
tmp - folder for temporary files

Back to top


Version: 0.8a, last change: 03/03/2013