codesimian
Class NaturalLanguage

java.lang.Object
  extended by codesimian.NaturalLanguage

public class NaturalLanguage
extends java.lang.Object

Converts strings of natural language, like this sentence, to arrays of numbers so computers can understand them better.


Nested Class Summary
static class NaturalLanguage.OnlyTheMostCommonSymbolsOnKeyboard
           
static class NaturalLanguage.RemoveEverythingExceptLetters
           
static interface NaturalLanguage.StringTransform
           
 
Field Summary
protected  java.lang.String delimiters
           
protected static java.util.Vector<java.lang.String> exampleText
           
protected  java.util.HashMap<java.lang.String,java.lang.String[]> similarStrings
          Strings that can often be used interchangably, like "3" with "e" in "b33r" or "beer" Also bigger strings like "y" with "ies" in "penny" or "pennies".
protected  java.lang.String[] word
          most common words have lowest indexs
protected  int[] wordsFound
           
protected  java.util.HashMap<java.lang.String,java.lang.Integer> wordToIndex
          key is a word.
 
Constructor Summary
NaturalLanguage(int wordsToUse)
          You must addExampleText() some sentences for me to use in my calculations.
NaturalLanguage(java.lang.String sentences, int wordsToUse)
          Same as NaturalLanguage(int) except calls addExampleText(sentences)
 
Method Summary
static void addExampleText(java.lang.String t)
           
protected  void addSimilarStrings(java.lang.String[] similarToEachOther)
          Adds all to similarStrings, each as its own key, and the array is the value.
 void calculateNewMostCommonWords()
          fills word[] (and wordToIndex) with the most common formatted words from exampleText.
static double chanceIsCorrectlySpelledWord(java.lang.String possibleWord)
          the first call causes this class to read all of codesimian's inner text files
static double chanceTextIsNatLang(java.lang.String possibleNaturalLanguage)
          Returns between 0 (certainly not natural language) and 1 (certainly natural language)
static java.lang.String endWithPunctuation(java.lang.String s, char punctuation)
          replaces whatever punctuation, if any, the String ends with, or adds punctuation at the end.
 java.lang.String formatWord(java.lang.String word)
          Formats a word in a standard way.
 java.lang.String getDelimiters()
           
 int getIndex(java.lang.String theWord)
          returns the index of a wordwordToIndex.get(word).intValue().
 java.lang.String getWord(int index)
          returns word[index].
 boolean setDelimiters(java.lang.String delimiters)
           
 int[] tokenize(java.lang.String sentences)
          Returns a sequence of ints representing the most common words (from the other main text) in 'sentences'.
 int wordCount()
          Returns how many unique words are recognized, word.length.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

wordToIndex

protected java.util.HashMap<java.lang.String,java.lang.Integer> wordToIndex
key is a word. value is an Integer index for indexToWord.
word[ wordToIndex.get("is").intValue() ] equals "is".


word

protected java.lang.String[] word
most common words have lowest indexs


wordsFound

protected int[] wordsFound

delimiters

protected java.lang.String delimiters

similarStrings

protected java.util.HashMap<java.lang.String,java.lang.String[]> similarStrings
Strings that can often be used interchangably, like "3" with "e" in "b33r" or "beer" Also bigger strings like "y" with "ies" in "penny" or "pennies". Anywhere one string is found, an AI might think about what the text would mean if parts were replaced by a similar string. To cut down on the number of words, all similar words may be thought of as the same word.

Key is some string that is a key or value in similarStrings.
Value is a String[] array containing the key you searched for and all strings similar to it.


exampleText

protected static java.util.Vector<java.lang.String> exampleText
Constructor Detail

NaturalLanguage

public NaturalLanguage(int wordsToUse)
You must addExampleText() some sentences for me to use in my calculations.

Parameters:
wordsToUse - number of words that are recognized. All other words are ignored and never returned in a sequence

NaturalLanguage

public NaturalLanguage(java.lang.String sentences,
                       int wordsToUse)
Same as NaturalLanguage(int) except calls addExampleText(sentences)

Method Detail

getIndex

public int getIndex(java.lang.String theWord)
returns the index of a wordwordToIndex.get(word).intValue(). If the word is not already formatted, call getIndex(formatWord(word)) instead. The most common word is 0, second most common is index 1... Returns -1 if word does not exist therefore has no index.


getWord

public java.lang.String getWord(int index)
returns word[index]. The most common word is 0, second most common is index 1...


wordCount

public int wordCount()
Returns how many unique words are recognized, word.length. Indexs range 0 to wordCount()-1.


setDelimiters

public boolean setDelimiters(java.lang.String delimiters)

getDelimiters

public java.lang.String getDelimiters()

chanceIsCorrectlySpelledWord

public static double chanceIsCorrectlySpelledWord(java.lang.String possibleWord)
the first call causes this class to read all of codesimian's inner text files


addSimilarStrings

protected void addSimilarStrings(java.lang.String[] similarToEachOther)
Adds all to similarStrings, each as its own key, and the array is the value. If a String has already been added as a similar string, its value is updated to this new array.


formatWord

public java.lang.String formatWord(java.lang.String word)
Formats a word in a standard way. Makes word lower-case and changes most plurals to singular. Assumes english language. Removes anything thats not letter or digit. Returns null if word looks like its not a word.


addExampleText

public static void addExampleText(java.lang.String t)

calculateNewMostCommonWords

public void calculateNewMostCommonWords()
fills word[] (and wordToIndex) with the most common formatted words from exampleText. Call this once after a lot of text has been entered with addExampleText(), and before tokenize().


tokenize

public int[] tokenize(java.lang.String sentences)
Returns a sequence of ints representing the most common words (from the other main text) in 'sentences'. Tokenizes and formats the words before checking if they equal known words.


endWithPunctuation

public static java.lang.String endWithPunctuation(java.lang.String s,
                                                  char punctuation)
replaces whatever punctuation, if any, the String ends with, or adds punctuation at the end.


chanceTextIsNatLang

public static double chanceTextIsNatLang(java.lang.String possibleNaturalLanguage)
Returns between 0 (certainly not natural language) and 1 (certainly natural language)