Questions tagged [n-gram]
An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.
n-gram
880
questions
0
votes
1
answer
66
views
How to save n-gram output
A hopefully simple question. How can I save the ngram output from the following code?
\\
library("quanteda")
## Package version: 2.1.2
data(data_corpus_inaugural)
toks <- ...
0
votes
1
answer
36
views
letter and bigram composition for each word in the dataframe
I have a data frame with words and I want to extract the letter and bigram composition for each word.
Data:
df$text
[1] "table"
[2] "run"
[3] "mug"`
And in the end I ...
1
vote
1
answer
38
views
How do I determine the weight? depending on what?
I'm trying to calculate the n--gram using Python. The weight I used for for uni-gram, bi-gram, tri-gram, and 4-gram is (0.25, 0.25, 0, 0).
When I run the script for the first reference it gives me a ...
0
votes
0
answers
18
views
How to calculate the frequency of bigrams on fixed size windows
I am computing the frequency of bigrams given a list of token files tokenized_corpus = ['tokens_A.pickle', 'tokens_B.pickle', ...] where every tokens_X file unpickles as ['x', 'a', 'b', 'a', 'b', 'd', ...
1
vote
0
answers
32
views
Better performance and results for autocomplete search edge_ngram or search_as_you_type elasticsearch
I was testing and researching about the use of edge_ngrams and the search_as_you_type field in Elasticsearch to improve search results, but I see that they are very similar and I would like to know ...
0
votes
0
answers
20
views
How to find pmi and phrase-count for everygrams?
Using NLTK's library I can find metrics about bi and trigrams . Now I want to find all the possible phrases and find their occurence count and PMI score as I did with the bi-grams and trigrams like ...
0
votes
0
answers
27
views
Bitextor/Bicleaner MAX_ORDER Issue
I am trying to analyze a translation file (with English-French sentence pairs) using Bicleaner (https://github.com/bitextor/bicleaner). I have a "test corpus" with ten sentence pairs ...
0
votes
0
answers
48
views
String Matching Function Not Matching Strings Despite Threshold Set to 0
I have implemented a string matching function in Python utilizing n-grams and similarity ratios. The function signature is as follows:
# concise version of the function
def match_strings(...
-2
votes
1
answer
52
views
Incorporating Phone Number Matching into Existing String based Name Matching Function
I have a Python function, match_strings, which is designed to match names from two different data sources. Here is the function definition:
python
def match_strings(strings1, strings2, ngram_n=2, ...
0
votes
0
answers
12
views
Ideal number of <BOS> tags in N-gram Language Model
Let us assume there is a sentence "There is a monkey". Now, let us try to create Trigrams after appending Beggining of String, End of String (<BOS>, <EOS>) tags to the string.
...
1
vote
1
answer
91
views
How to count char tuples efficiently in PHP
I need to fast count char tuples (or N-grams) in huge files/strings (from 10MB+ up to 1GB+) within a PHP project (a file classifier).
The current implementation is made for single characters count (N=...
0
votes
1
answer
62
views
How does elasticsearch count tf-idf? That looks weird
I have an index with documents that store system information and searchable fields that are copied into searchable_keys field In this case, there is only one such field - name.
Here's the definition ...
0
votes
0
answers
26
views
BERTopic n-gram phrases are not adjacent to each other
ngram_range parameter of BERTopic is outputting n-grams with words far away from each other
After setting the ngram_range=(2,2), the trained BERTopic model generates topics with 2-gram phrases such as ...
0
votes
2
answers
212
views
Python IntelliJ style 'search everywhere' algorithm
I have a list of file names in python like this:
HelloWorld.csv
hello_windsor.pdf
some_file_i_need.jpg
san_fransisco.png
Another.file.txt
A file name.rar
I am looking for an IntelliJ style search ...
1
vote
1
answer
135
views
bigram calculation - Memory error, large file problem
Here is a code for bigram calculation from the text corpus:
import sys
import csv
import string
import nltk
from nltk import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.util ...