natural 0.3.0

Pure rust library for natural language processing.
Documentation
rs-natural
==========

[![Build Status](https://travis-ci.org/cjqed/rs-natural.svg?branch=master)](https://travis-ci.org/cjqed/rs-natural)

Natural language processing library written in Rust. Still very much a work in progress. Basically an experiment, but hey maybe something cool will come out of it.

Currently working:

* Jaro-Winkler Distance
* Levenshtein Distance
* Tokenizing
* NGrams (with and without padding) 
* Phonetics (Soundex)
* Stemming (Using a fork of [rust-stem]https://github.com/mrordinaire/rust-stem)
* Naive-Bayes classification
* Term Frequency-Inverse Document Frequency(tf-idf)
 
Near-sight goals:

* Logistic regression classification
* Optimize naive-bayes (currently pretty slow)
* Plural/Singular inflector

## How to use ##


Use at your own risk. Some functionality is missing, some other functionality is slow as molasses because it isn't optomized yet. I'm targeting master, and don't offer backward compatibility. 

### Setup ###

It's a crate with a cargo.toml. Add this to your cargo.toml:

```
[dependencies]
natural = "0.2.1"
```

### Distance ###


```rust
extern crate natural;
use natural::distance::jaro_winkler_distance;
use natural::distance::levenshtein_distance;

assert_eq!(levenshtein_distance("kitten", "sitting"), 3);
assert_eq!(jaro_winkler_distance("dixon", "dicksonx"), 0.767); 

```

Note, don't actually `assert_eq!` on JWD since it returns an f64. To test, I actually use:

```rust
fn f64_eq(a: f32, b: f32) {
  assert!((a - b).abs() < 0.01);
}

```

### Phonetics ###


There are two ways to gain access to the SoundEx algorithm in this library, either through a simple `soundex` function that accepts two `&str` parameters and returns a boolean, or through the SoundexWord struct. I will show both here.

```rust
use natural::phonetics::soundex;
use natural::phonetics::SoundexWord;

assert!(soundex("rupert", "robert"));


let s1 = SoundexWord::new("rupert");
let s2 = SoundexWord::new("robert");
assert!(s1.sounds_like(s2));
assert!(s1.sounds_like_str("robert"));

```

### Tokenization ###


```rust
extern crate natural;
use natural::tokenize::tokenize;

assert_eq!(tokenize("hello, world!"), vec!["hello", "world"]);
assert_eq!(tokenize("My dog has fleas."), vec!["My", "dog", "has", "fleas"]);

```

### NGrams ###


You can create an ngram with and without padding, e.g.:

```rust
extern crate natural;

use natural::ngram::get_ngram;
use natural::ngram::get_ngram_with_padding;

assert_eq!(get_ngram("hello my darling", 2), vec![vec!["hello", "my"], vec!["my", "darling"]]);

assert_eq!(get_ngram_with_padding("my fleas", 2, "----"), vec![
  vec!["----", "my"], vec!["my", "fleas"], vec!["fleas", "----"]]);
```

### Classification ###


```rust
extern crate natural;
use natural::classifier::NaiveBayesClassifier;

let mut nbc = NaiveBayesClassifier::new();

nbc.train(STRING_TO_TRAIN, LABEL);
nbc.train(STRING_TO_TRAIN, LABEL);
nbc.train(STRING_TO_TRAIN, LABEL);
nbc.train(STRING_TO_TRAIN, LABEL);

nbc.guess(STRING_TO_GUESS); //returns a label with the highest probability
```

### Tf-Idf ###


```rust
extern crate natural;
use natural::tf_idf::TfIdf;

tf_idf.add("this document is about rust.");
tf_idf.add("this document is about erlang.");
tf_idf.add("this document is about erlang and rust.");
tf_idf.add("this document is about rust. it has rust examples");

println!(tf_idf.get("rust")); //0.2993708f32
println!(tf_idf.get("erlang")); //0.13782766f32

//average of multiple terms
println!(tf_idf.get("rust erlang"); //0.21859923
```