2008 02 Syncing It Syncing a Libferris Filesystem with an Xml File or Database


Syncing a libferris filesystem with an XML file or database
Syncing It
With libferris, FUSE, and rsync, you can synchronize a filesystem with a dissimilar data source.
By Ben Martin
micjan, photocase.com
Admins use rsync to snchronize two filesystem trees. With a few tricks, you can use FUSE and libferris with
rsync [1][2][3] to synchronize a filesystem with another data source such as an XML file or a PostgreSQL
database. Libferris is a user address space Virtual FileSystem (VFS) that lets you mount almost any data
source as a filesystem. Examples of data sources libferris can mount include XML files, Berkeley db4 files,
rpm packages, relational databases, LDAP servers, web servers, and applications like XWindow, Emacs,
xmms, Amarok, and Firefox.
Libferris also includes evolving support for mounting web services. For example, you can interface a libferris
directory with a photo-sharing website like 23hq or Flickr. In this article, I will discuss some of the
possibilities for using rsync to synchronize a libferris filesystem with an XML file or database.
The ferrisfs application lets you expose libferris filesystems through FUSE. In the most basic form, ferrisfs
requires two arguments. First, you can pass the URL of a libferris filesystem using --url. The last argument is
where you want the FUSE filesystem to appear in your Linux kernel filesystem tree. Normally, I create a fuse
subdirectory in my home directory where all my FUSE mount points appear.
Metadata and Search
Apart from mounting miscellaneous data sources, the other two goals of libferris are metadata handling and
filesystem search.
Libferris comes with support for automatic metadata extraction and lets you add explicit metadata to any file
on any filesystem regardless of the user's write permission.
As an example of libferris' metadata capability, consider adding a handy tag to a file on an FTP server in
libferris for later identification. Even if the user does not have write access to the FTP server, libferris will
store the metadata in Resource Description Framework (RDF) to associate the tag with the file. On the other
hand, for a file in a home directory, if you add a metadata tag, libferrris will store the metadata in a kernel
extended attribute to give non-libferris applications access via the attr(1) interface.
Metadata extraction in libferris covers simple cases such as extracting the dimensions and Exif data of image
files, as well as more advanced cases. For example, if you tag files in the F-Spot photo management tool, you
can then access those tags using libferris.
Syncing It 1
Filesystem search support in libferris allows you to create multiple filesystem indexes.
Plugins are used to let you build indexes using PostgreSQL, Lucene, Xapian, and other tools. You can even
link indexes together to create a federation.
Recent versions support using libferris through FUSE, giving unmodified applications direct access to
anything libferris sees as a filesystem.
Steps
Listing 1 shows some of the steps for setting up an interaction with a libferris-backed FUSE filesystem. First a
very basic XML file is created and mounted at ~/fuse/simple-xml.
Listing 1: FUSE Interaction on a Mounted XML File
01 $ cat simple-xml.xml
02
03
04

05 $ mkdir simple-xml
06 $ ferrisfs --url ~/fuse/simple-xml.xml/simple-xml \
07 simple-xml
08 $ ll simple-xml
09 total 0
10 -rwx------ 0 ferristester ferristester 0 Jan 1 1970 something*
11 $ date >| simple-xml/something
12 $ cat simple-xml.xml
13
14
15 Tue May 22 22:48:57 EST 2007
16

17

Notice that the --url parameter selects the first element in the XML file as the libferris filesystem (instead of
the XML file itself).
XML files must have a single root element; by mounting that root element instead of the XML file, you avoid
exposing this detail to the applications using the FUSE filesystem.
Normal filesystem metadata is mirrored in the XML file using XML attributes. By updating the contents of a
file under the FUSE mount point, libferris both updates the contents of the XML element and records the
modification time in an XML attribute.
Listing 2 shows rsync on a libferris-backed FUSE filesystem. First, the source-native-fs directory is created
and populated with some simple test files. Other than the use of the - -temp-dir command-line option, the
command looks like any other invocation of rsync.
Listing 2: Rsync to XML
01 $ mkdir source-native-fs
02 $ cd source-native-fs
03 $ date >datefile1.txt
04 $ date >datefile2.txt
05 $ touch emptyA
06 $ echo -n "hi there" > main
07 $ cd ~/fuse
08 $ mkdir ~/fuse/rsync-junk
09 $ rsync -avz -T ~/fuse/rsync-junk \
10 source-native-fs/ simple-xml/
11 $ cat simple-xml.xml
12
13 14 mtime="1179838199">
15 Syncing It 2
16 >Tue May 22 22:48:57 EST 2007
17

18 19 mtime="1179838179">Tue May 22 22:49:39 EST 2007
20

21 ...
22
23 mtime="1179838199">hi there

24

25 $ rsync -avz --delete-after \
26 -T ~/fuse/rsync-junk \
27 source-native-fs/ simple-xml/
28 building file list ... done
29 deleting something
30 sent 159 bytes received 20 bytes 358.00 bytes/sec
31 total size is 66 speedup is 0.37
32 $ grep something simple-xml.xml
33 0
34 $ fusermount -u simple-xml
The final rsync invocation uses the - -delete-after option to remove the something file, which was originally
part of the XML file but is not part of the source filesystem passed to rsync.
The grep command checks that something is no longer part of the XML file after the sync.
The previous section showed data being synced between a native kernel filesystem (ext3 in this case) and a
subtree in an XML file.
Sync Across Filesystem Types
The libferris and FUSE combination allows you to convert between different data formats while you are
performing the sync. By exposing part of an XML file through libferris and FUSE, you can keep various parts
of an XML file in sync with other data - perhaps involving many different rsync invocations covering
different parts of a single XML file.
The ability to rsync between different filesystems like this can be very convenient when both filesystems
provide different features and you want a combination of these features. For example, many tools make
editing XML simple, though accessing a single element (file) in XML is much slower than accessing a single
file in a db4 file.
The commands shown in Listing 3 keep a db4 file in sync with the contents of an XML file. The simple-xml
FUSE filesystem, which is based on the simple-xml.xml file in Listing 1, is reused here. If there are attributes
in the XML file that are not the standard lstat(2) attributes, they are exposed by the libferris FUSE filesystem
as extended attributes.
Listing 3: Rsyncing an XML File into a db4 File
01 $ fcreate `pwd` --create-type=db4 name=db4.db
02 $ mkdir db4
03 $ ferrisfs -u ~/fuse/db4.db db4
04 $ rsync -avz --delete-after -T ~/fuse/rsync-junk simple-xml/ db4/
05 $ db_dump -p db4.db
06 VERSION=3
07 format=print
08 type=btree
09 db_pagesize=4096
10 HEADER=END
11 /atime
12 1179840317
13 /datefile1.txt/atime
14 1179840317
15 /datefile1.txt/mode
16 100664
Syncing It 3
17 /datefile1.txt/mtime
18 1179838179
19 ...
20 datefile1.txt
21 Tue May 22 22:49:39 EST 2007\0a
The rsync command has support for syncing extended attributes across filesystems using the -X (--xattrs)
command-line option. In syncing extended attributes, libferris creates many virtual attributes to expose extra
metadata about the filesystem.
To get around this extra metadata libferris wants to offer, the ferrisfs command has the option to limit what
attributes are reported from the FUSE filesystem. For example, using --show-ea=user.dislikes will make the
FUSE filesystem report only the user.dislikes extended attribute. The result is that rsync will only try to sync
that one extended attribute instead of a lot of other metadata that libferris makes available.
Another complication of syncing extended attributes is that filesystems report attributes that can be user
modified with the user. prefix, so the attribute dislikes will only be readable by getxattr(2) using the name
user.dislikes. As many XML files are not likely to have the user. prefix in their XML attributes, there is the
ferrisfs - -prepend-user-dot-prefix-to-ea-regex command-line option to explicitly add user. to any attributes
that match the given regular expression.
Listing 4 shows a first attempt to sync XML attributes as well as file content with ferrisfs and rsync. The first
db_dump execution shows that none of the XML attributes have been written to the Berkeley db4 file. Using
the rsync -X (--xattrs) command-line option to try to correct this gives the error message about "as-xml" not
being available through getxattr().
Listing 4: Using Rsync to Sync XML Attributes
01 $ fcreate `pwd` --create-type=db4 name=target.db
02 $ mkdir target
03 $ ferrisfs -u `pwd`/target.db target
04 $ cat attributes-in-xml.xml
05

06
07
08

09 $ mkdir attributes-in-xml
10 $ ferrisfs -u `pwd`/attributes-in-xml.xml/main \
11 attributes-in-xml
12 $ rsync -avz --delete-after -T ~/fuse/rsync-junk \
13 attributes-in-xml/ target/
14 $ db_dump -p target.db
15 VERSION=3
16 ...
17 HEADER=END
18 gaw
19 sub1
20 DATA=END
21 $ rsync -X -avz --delete-after -T ~/fuse/rsync-junk \
22 attributes-in-xml/ target/
23 ...building file list ...
24 rsync: rsync_xal_get: lgetxattr(".","as-xml",37199)
25 failed: Input/output error (5)
26 ...
27 $ db_dump -p target.db
28 VERSION=3
29 ...
30 HEADER=END
31 gaw
32 sub1
33 DATA=END
34 $ fusermount -u attributes-in-xml
35 $ ferrisfs -u `pwd`/attributes-in-xml.xml/main \
36 --show-ea-regex="(attr1|another|second)" \
37 --prepend-user-dot-prefix-to-ea-regex=".*" \
Syncing It 4
38 attributes-in-xml
39 $ rsync -X -avz --delete-after -T ~/fuse/rsync-junk \
40 attributes-in-xml/ target/
41 $ db_dump -p target.db
42 ...
43 HEADER=END
44 /gaw/user.another
45 value
46 /sub1/user.attr1
47 hello
48 /sub1/user.second
49 world
50 gaw
51 sub1
52 DATA=END
The trick is to use the ferrisfs - -show-ea-regex and - -prepend-user-dot-prefix-to-ea-regex options to only
show the extended attributes you are interested in. If an attribute that matches show-ea-regex is available for a
virtual libferris file, ferrisfs will export that attribute to FUSE as an extended attribute. As the final db_dump
shows, the XML attributes are now available in the db4 file as well.
Listing 5 is a simple table in a PostgreSQL database. The table can be mounted by using the postgresql:// or
pg:// URL in libferris, as the ferrisls command shows. Using a PostgreSQL table as the source for rsync
presents no new issues with how to invoke ferrisfs, as shown in Listing 6. Each column in the table becomes
an extended attribute in the target filesystem.
When the file contents of a tuple is read by libferris, it gives an XML serialized version of the data. As the
extended attributes also give the same information in broken down format, you don't really care about the
tuple's file content. Listing 6 solves this issue by reporting that all the tuples are zero-byte files.
Listing 5: Accessing a PostgreSQL Database
01 $ psql ferristester
02 ferristester=> \d foobar
03 Table "public.foobar"
04 Column | Type | Modifiers
05 ---------+------------------------+-----------
06 fooid | integer | not null
07 fooname | character varying(100) |
08 e | character varying(100) |
09 Indexes:
10 "foobar_pkey" PRIMARY KEY, btree (fooid)
11 ferristester=> select * from foobar;
12 fooid | fooname | e
13 -------+---------+-----------------------
14 10 | William |
15 45 | Rick | 15 credibility street
16 3002 | Satou | Tokyo
17 101 | John | Some data
18 (4 rows)
19 ferristester=> \q
20 $ ferrisls --xml pg://localhost/ferristester/foobar
21
22
23 24 name="foobar" primary-key="fooid" ...
25 url="pg:///localhost/ferristester/foobar">
26 27 fooname="William" name="10".../>
28 29 fooname="Satou" name="3002".../>
30 ...
31

32

Listing 6: Rsyncing Data Out of a Table
Syncing It 5
01 $ mkdir pg
02 $ ferrisfs --show-ea=user.fooid,user.fooname,user.e \
03 --prepend-user-dot-prefix-to-ea-regex=".*" \
04 --force-empty-file-contents-regex=".*" \
05 -u pg://localhost/ferristester/foobar pg
06 $ ls -l pg
07 total 0
08 -rwx------ 0 ferristester ferristester 50 Jan 1 1970 10
09 -rwx------ 0 ferristester ferristester 57 Jan 1 1970 101
10 -rwx------ 0 ferristester ferristester 55 Jan 1 1970 3002
11 -rwx------ 0 ferristester ferristester 68 Jan 1 1970 45
12 $ cd pg
13 $ attr -l 101
14 Attribute "fooid" has a 3 byte value for 101
15 Attribute "fooname" has a 4 byte value for 101
16 Attribute "e" has a 9 byte value for 101
17 $ attr -g fooname 101
18 Attribute "fooname" had a 4 byte value for 101:
19 John
20 $ cd ..
21 $ mkdir target
22 $ rsync -Cavz -X -T ~/fuse/rsync-junk pg/ target/
23 building file list ... done
24 ./
25 10
26 101
27 3002
28 45
29 7
30 sent 762 bytes received 136 bytes 1796.00 bytes/sec
31 total size is 0 speedup is 0.00
32 $ cd target
33 $ attr -l 3002
34 Attribute "e" has a 5 byte value for 3002
35 Attribute "fooid" has a 4 byte value for 3002
36 Attribute "fooname" has a 5 byte value for 3002
37 $ attr -g e 3002
38 Attribute "e" had a 5 byte value for 3002:
39 Tokyo
Synching into PostgreSQL
Synchronizing information into a PostgreSQL database with rsync presents extra issues because a database
table does not behave exactly like a filesystem. For example, as shown in Listing 5, the primary key of the
table is fooid. Without specifying at least the primary key of the tuple to create, you cannot make a new file in
a mounted PostgreSQL table.
Also, when the file contents of a tuple is read by libferris, it gives an XML serialized version of the tuple
itself. Updating both the XML serialized version of a tuple and each individual table column through the
extended attributes would be twice the effort. The --throw-away-write-to-file-contents-regex command-line
option to ferrisfs solves the latter problem by ignoring anything that is written to the file's contents for files
that have a URL matching the given regular expression. Updates must happen via the extended attributes
interface.
The --delay-commit-path ferrisfs command-line option was added to solve the primary key issue. The
nominated path allows new files to be created and extended attributes written on those new files without
immediately trying to update the database. Listing 7 shows how to rsync into a PostgreSQL table.
Listing 7: Rsyncing into a PostgreSQL Table
01 $ ferrisfs --show-ea=user.fooname,user.e,user.fooid \
02 --prepend-user-dot-prefix-to-ea-regex=".*" \
03 --throw-away-write-to-file-contents-regex=".*" \
04 --delay-commit-path=pg:///localhost/ferristester/foobar \
05 --delay-commit-path-trigger-ea=user.fooname \
Syncing It 6
06 --throw-away-write-to-ea-regex=".*foobar" \
07 -u pg://localhost/ferristester/foobar pg
08 $ rsync -avz -X -T ~/fuse/rsync-junk target/ pg/
09 building file list ... done
10 10
11 101
12 3002
13 45
14 7
15 sent 756 bytes received 130 bytes 590.67 bytes/sec
16 total size is 0 speedup is 0.00
17 $ cd target
18 $ ll
19 total 28K
20 -rwx------ 1 ferristester ferristester 50 Jan 1 1970 10*
21 -rwx------ 1 ferristester ferristester 68 Jan 1 1970 45*
22 -rwx------ 1 ferristester ferristester 57 Jan 1 1970 101*
23 -rwx------ 1 ferristester ferristester 55 Jan 1 1970 3002*
24 $ attr -g fooname 10
25 Attribute "fooname" had a 7 byte value for 10:
26 William
27 $ attr -s fooname -V "Willie" 10
28 Attribute "fooname" set to a 6 byte value for 10:
29 Willie
30 $ touch 7
31 $ attr -s fooid -V 7 7
32 Attribute "fooid" set to a 1 byte value for 7:
33 7
34 $ attr -s fooname -V new-item 7
35 Attribute "fooname" set to a 8 byte value for 7:
36 new-item
37 $ cd ..
38 $ rsync -avz -X -T ~/fuse/rsync-junk target/ pg/
The commands shown in Listing 8 create a second table and then populate it from foobar using rsync. If the
commands from the mkdir command down are run again at a later time, then foo2 is updated using rsync with
changes from the foobar table.
Listing 8: Keeping a Copy of a PostgreSQL Table
01 $ psql ferristester
02 ferristester=> create table foo2
03 ( fooid serial primary key,
04 fooname varchar(100),
05 e varchar(100));
06 ferristester=> \q
07 $ mkdir -p foo2
08 $ ferrisfs --show-ea=user.fooname,user.e,user.fooid \
09 --prepend-user-dot-prefix-to-ea-regex=".*" \
10 --force-empty-file-contents-regex=".*" \
11 --force-empty-read-from-ea-regex=".*foobar" \
12 -u pg://localhost/ferristester/foobar pg
13 $ ferrisfs --show-ea=user.fooname,user.e,user.fooid \
14 --prepend-user-dot-prefix-to-ea-regex=".*" \
15 --throw-away-write-to-file-contents-regex=".*" \
16 --delay-commit-path=pg:///localhost/ferristester/foo2 \
17 --delay-commit-path-trigger-ea=user.fooname \
18 --throw-away-write-to-ea-regex=".*foo2" \
19 -u pg://localhost/ferristester/foo2 foo2
20 $ rsync -avz -X -T ~/fuse/rsync-junk pg/ foo2/
21 $ fusermount -u pg
22 $ fusermount -u foo2
Future Directions
Support for rsync with PostgreSQL currently revolves around single tables. In the future, this support should
expand to allow rsync to operate on an entire database at once.
Syncing It 7
Also, adding support for other syncing solutions like Unison [5] and Harmony [6] will be very interesting.
INFO
[1] libferris: http://witme.sourceforge.net/libferris.web/
[2] rsync: http://rsync.samba.org/
[3] Filesystem in Userspace: http://fuse.sourceforge.net/
[4] fuselagefs and delegatefs:
http://sourceforge.net/project/showfiles.php?group_id=16036&package_id=225200
[5] Unison bidirectional sync: http://www.cis.upenn.edu/~bcpierce/unison/
[6] Harmony bidirectional sync: http://www.seas.upenn.edu/~harmony/
THE AUTHOR
Ben Martin has been working on filesystems for more than 10 years. He is currently working toward a PhD.
His research focuses on combining semantic filesystems with formal concept analysis to improve
human-filesystem interaction.
Syncing It 8


Wyszukiwarka