REMOTE SCHEMA
Server File Journal - stores all changes
Namespace Id (NSID) Relative Path in namespace Journal ID (JID): Monotonically increasing within a namespace
BlockServer - can store block or retrieve block
- RocksDB might work
Q:
- where to store chunks? s3 is to expensive for such small files, maybe cheap distributed key/value db?
LOCAL DB SCHEMA
files
jid: integer path // relative to current dir format: text|binary modified: unix timestamp size: integer is_symlink: bool checksum: varchar
USE-CASES
- client needs to update a file from meta server (MS)
- S during polling receives that file /path/bla was updated
- sends list request passing namespace and current cursor
- MS returns all JIDs since passed one and their hashes (maybe except when the same file was updated multiple times, returns only the last one?)
- S
- client needs to upload a file to server
- S tries to commit current file it has commit(/path/bla, [h1,h2,h3])
- MS returns back list of
- program just starts
- S checks the latest journal_id
- if local latest journal_id is the same it will do nothing
- if local latest journal_id
- file was removed locally
- file was moved locally
- file was renamed
- one line in a file was edited
- one line in a file was added
- one line in a file was removed
if latest jid remotely bigger sync dowload from remote if metadata, size is different upload to remote and after commit store into local db
Q:
- do I need hierarchy of services or they should be all independent?
- how sharing should work?
- how to thread it? multiple modules and multiple files
- do I need to sync file metadata as well?
We have separate threads for sniffing the file system, hashing, commit, store_batch, list, retrieve_batch, and reconstruct, allowing us to pipeline parallelize this process across many files. We use compression and rsync to minimize the size of store_batch/retrieve_batch requests.
SYNCER
- checks if database has not assigned jid
- when it finds not assigned jid it will try to commit, after commiting it will update local DB with new jid
- if chunk is not present locally it will try to download it
- if chunk is not present remotely it will try to upload it
commit("breakfast/Mexican Style Burrito.cook", "h1,h2,h3");
Q:
- problem if by line? => seek wont work, need to store block size to do the seek effeftively.
- where to store chunks for not yet assembled file
- how to understand that a new file created remotely
- hot to understand that file was deleted
- how to understand that
INDEXER
- sync between files and local DB on schedule (once a min, f.e.)
- watches changes and triggers sync
- will cleanup DB once a day
Q:
- do I need to copy not changed jid? or just update updated? => it makes sense to update all
- what happens on delete, move?
CHUNKER
Role of Chunker is to deal with persistance of hashes and files. It operates on text files and chunks are not a fixed sized but each chunk is a line of file.
- given path it will produce list of hashes of the file:
fn hashify(file_path: String) -> io::Result<Vec<String>>
- given path and list of hashes it will save a new version of a file
fn save(file_path: String, Vec<String>) -> io::Result
. It should raise an error if cache doesn't have content for a specific chunk hash - can read content of a specific chunk from cache
fn read_chunk(chunk: String) -> io::Result<String>
- can write content of a spefic chunk to cache
fn save_chunk(chunk: String, content: String) -> io::Result
- given two vectors of hashes it can compare them if they are the same
fn compare_sets(left: Vec<String>, right: Vec<String>) -> bool
- given hash it can check if cache contains content for it or not.
fn check_chunk(chunk: String>) -> io::Result<bool>
Q:
-
strings will be short, 80-100 symbols. what should be used as hashing function? what size of hash should be? I'd say square root of 10. You can test it!
-
empty files should be different from deleted
TODO
-
bundling of uploads/downloads
-
read-only
-
namespaces
-
proper error handling
-
report error on unexpeted cache behaviour
-
don't need to throw unknown error in each non-200 response
-
remove clone
-
limit max file
-
configuration struct
-
pull changes first or reindex locally first? research possible conflict scenarios
-
extract to core shared datasctuctures
-
garbage collection on DB
-
test test test
-
metrics for monitoring (cache saturation, miss)
-
protect from ddos https://github.com/rousan/multer-rs/blob/master/examples/prevent_dos_attack.rs
-
auto-update client
open sourcing
- how to keep it available for opensource (one user?)
- add documentation
- draw data-flow