kaggle-Real or Not? NLP with Disaster Tweets ①

Posted by youmin park on 2020-11-23

Real or Not? NLP with Disaster Tweets

NLP Getting Stated Tutorial

1
2
3
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
1
2
3
train_df = pd.read_csv('/content/drive/MyDrive/kaggle/tweetDisater/data/train.csv')
test_df=pd.read_csv('/content/drive/MyDrive/kaggle/tweetDisater/data/test.csv')

an example of what is not a disater tweet

1
train_df[train_df["target"]==0]["text"].values[1]
'I love fruits'

an example of what is a disater tweet

1
train_df[train_df["target"]==1]["text"].values[1]
'Forest fire near La Ronge Sask. Canada'
CountVectorizer
단어들의 카운트(출현 빈도(frequency))로 여러 문서들을 벡터화
카운트 행렬, 단어 문서 행렬 (Term-Document Matrix, TDM))
모두 소문자로 변환시키기 때문에 me 와 Me 는 모두 같은 특성이 된다.
1
2
3
4
count_vectorizer = feature_extraction.text.CountVectorizer()

#처음 다섯개 트위터 count
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])
1
2
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())
(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]
  1. 처음 5개 트윗에 54개의 단어(tokens)들이 있음
  2. 첫번째 트윗은 54개의 unique words 중 몇개만 포함함. 0이 아닌것은 첫번째 트위터에 존재하지 않는 단어(token)

모든 트윗에 대해 vetors 만듬

1
2
3
train_vectors = count_vectorizer.fit_transform(train_df["text"])

test_vectors = count_vectorizer.transform(test_df["text"])
1
2

clf = linear_model.RidgeClassifier()
1
2
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv =3, scoring="f1")
scores
array([0.59453669, 0.56498283, 0.64082434])
1
clf.fit(train_vectors, train_df["target"])
RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
                max_iter=None, normalize=False, random_state=None,
                solver='auto', tol=0.001)
1
sample_submisstion = pd.read_csv("/content/drive/MyDrive/kaggle/tweetDisater/data/sample_submission.csv")
1
sample_submisstion.head()

id target
0 0 0
1 2 0
2 3 0
3 9 0
4 11 0
1
sample_submisstion["target"] = clf.predict(test_vectors)
1
sample_submisstion.head()

id target
0 0 0
1 2 1
2 3 1
3 9 0
4 11 1
1
sample_submisstion.to_csv("submission.csv", index=False)
1