Skip to main content
The 2024 Developer Survey results are live! See the results

Questions tagged [n-gram]

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

0 votes
1 answer
66 views

How to save n-gram output

A hopefully simple question. How can I save the ngram output from the following code? \\ library("quanteda") ## Package version: 2.1.2 data(data_corpus_inaugural) toks <- ...
bgreen's user avatar
  • 65
0 votes
1 answer
36 views

letter and bigram composition for each word in the dataframe

I have a data frame with words and I want to extract the letter and bigram composition for each word. Data: df$text [1] "table" [2] "run" [3] "mug"` And in the end I ...
Oksana Ts.'s user avatar
1 vote
1 answer
38 views

How do I determine the weight? depending on what?

I'm trying to calculate the n--gram using Python. The weight I used for for uni-gram, bi-gram, tri-gram, and 4-gram is (0.25, 0.25, 0, 0). When I run the script for the first reference it gives me a ...
user20003920's user avatar
0 votes
0 answers
18 views

How to calculate the frequency of bigrams on fixed size windows

I am computing the frequency of bigrams given a list of token files tokenized_corpus = ['tokens_A.pickle', 'tokens_B.pickle', ...] where every tokens_X file unpickles as ['x', 'a', 'b', 'a', 'b', 'd', ...
Mustafa's user avatar
  • 39
1 vote
0 answers
32 views

Better performance and results for autocomplete search edge_ngram or search_as_you_type elasticsearch

I was testing and researching about the use of edge_ngrams and the search_as_you_type field in Elasticsearch to improve search results, but I see that they are very similar and I would like to know ...
Andry Hernandez's user avatar
0 votes
0 answers
20 views

How to find pmi and phrase-count for everygrams?

Using NLTK's library I can find metrics about bi and trigrams . Now I want to find all the possible phrases and find their occurence count and PMI score as I did with the bi-grams and trigrams like ...
98fly's user avatar
  • 31
0 votes
0 answers
27 views

Bitextor/Bicleaner MAX_ORDER Issue

I am trying to analyze a translation file (with English-French sentence pairs) using Bicleaner (https://github.com/bitextor/bicleaner). I have a "test corpus" with ten sentence pairs ...
DevNoob_21's user avatar
0 votes
0 answers
48 views

String Matching Function Not Matching Strings Despite Threshold Set to 0

I have implemented a string matching function in Python utilizing n-grams and similarity ratios. The function signature is as follows: # concise version of the function def match_strings(...
NIDHI SHASTRY's user avatar
-2 votes
1 answer
52 views

Incorporating Phone Number Matching into Existing String based Name Matching Function

I have a Python function, match_strings, which is designed to match names from two different data sources. Here is the function definition: python def match_strings(strings1, strings2, ngram_n=2, ...
Rahul T's user avatar
0 votes
0 answers
12 views

Ideal number of <BOS> tags in N-gram Language Model

Let us assume there is a sentence "There is a monkey". Now, let us try to create Trigrams after appending Beggining of String, End of String (<BOS>, <EOS>) tags to the string. ...
Anant Kumar's user avatar
1 vote
1 answer
91 views

How to count char tuples efficiently in PHP

I need to fast count char tuples (or N-grams) in huge files/strings (from 10MB+ up to 1GB+) within a PHP project (a file classifier). The current implementation is made for single characters count (N=...
Crypto's user avatar
  • 191
0 votes
1 answer
62 views

How does elasticsearch count tf-idf? That looks weird

I have an index with documents that store system information and searchable fields that are copied into searchable_keys field In this case, there is only one such field - name. Here's the definition ...
Prosto_Oleg's user avatar
0 votes
0 answers
26 views

BERTopic n-gram phrases are not adjacent to each other

ngram_range parameter of BERTopic is outputting n-grams with words far away from each other After setting the ngram_range=(2,2), the trained BERTopic model generates topics with 2-gram phrases such as ...
David's user avatar
  • 1
0 votes
2 answers
212 views

Python IntelliJ style 'search everywhere' algorithm

I have a list of file names in python like this: HelloWorld.csv hello_windsor.pdf some_file_i_need.jpg san_fransisco.png Another.file.txt A file name.rar I am looking for an IntelliJ style search ...
Adam Griffiths's user avatar
1 vote
1 answer
135 views

bigram calculation - Memory error, large file problem

Here is a code for bigram calculation from the text corpus: import sys import csv import string import nltk from nltk import word_tokenize from nltk.tokenize import RegexpTokenizer from nltk.util ...
XTRUST.ORG's user avatar
  • 3,374

15 30 50 per page
1
2 3 4 5
59